graphql-hive / gateway

GraphQL gateway that can act as a Federation Gateway or a Proxy Gateway for any GraphQL service.
https://the-guild.dev/graphql/hive/docs/gateway
MIT License
7 stars 1 forks source link

Memory spikes when using plugins with `@graphql-hive/gateway@^1.0.8` #2

Closed jaffemd closed 1 week ago

jaffemd commented 2 weeks ago

We recently upgraded from graphql-mesh v0 to @graphql-mesh/compose-cli v1 + Hive Gateway as recommended by the migration guide.

Here are our relevant dependencies and versions:

"dependencies": {
    "@envelop/core": "^5.0.1",
    "@graphql-hive/gateway": "^1.0.8",
    "@graphql-mesh/compose-cli": "^1.0.2",
    "@graphql-mesh/supergraph": "^0.8.6",
    "graphql": "^16.8.1",
    "graphql-yoga": "^5.6.0",
  },

Here's our config:

import { defineConfig } from "@graphql-hive/gateway";
import plugins from "@app/plugins/init";

export const gatewayConfig = defineConfig({
  supergraph: "supergraph.graphql",
  port: 8000,
  plugins: () => [...plugins],
  executionCancellation: true,
  upstreamCancellation: true,
  pollingInterval: oneYearInMs,
});

Before V1, memory usage was a plateau and stable. After upgrading to use hive gateway, we immediately observed unstable memory utilization.

image-20241014-135005

Zooming in, every 15 to 30 minutes, there is a sharp spike in memory.

image-20241014-134758

Our only clue was this release note in mesh v0.98.7 that referenced memory leaks from plugins.

We're using a mix of homegrown plugins that perform various functions such as datadog tracing and graphql-armor vendor plugins. We ran a short experiment to turn them all off and didn't observe any memory spikes:

Screenshot 2024-10-16 at 10 30 50 AM

To isolate against the possibility of the content of our plugins being the issue, we created a barebones empty plugin to see if we still saw a memory spike just with that, and we did. Using the below config, with just a plugin that hooks into onFetch and onExecute, we still saw a memory spike after around 20 minutes.

export const gatewayConfig = defineConfig({
  supergraph: "supergraph.graphql",
  port: 8000,
  plugins: () => [
    {
      onFetch: () => {},
      onExecute: () => {},
    }
  ],
  executionCancellation: true,
  upstreamCancellation: true,
  pollingInterval: oneYearInMs,
});
Screenshot 2024-10-16 at 10 31 43 AM

This made it seem clear to us that there must be some memory leakage going on within the plugin infrastructure that could be similar to the issue referenced in the graphql-mesh release notes.

enisdenjo commented 2 weeks ago

Hey there! Thanks for reporting. For me to debug this in depth I'd need to a bit more about the test env you have and create a benchmark that's replicating the behaviour in order for to pin-point the issue.

Can you tell me:

  1. Which Node version are you using?
  2. Does the traffic change for the 15-20min spike, what is happening during that time?
  3. Is there consistently a spike every 15-20mins or at random times? Also during low traffic?
  4. How are you performing the test? Constant VUS over time or are you using some sort of real traffic?
jaffemd commented 2 weeks ago

@enisdenjo Thank you for the quick reply!

  1. Which Node version are you using?

We're running the gateway on a docker container in a kubernetes pod. The docker container is running node 20.14.0.

  1. Does the traffic change for the 15-20min spike, what is happening during that time?
  2. Is there consistently a spike every 15-20mins or at random times? Also during low traffic?
  3. How are you performing the test? Constant VUS over time or are you using some sort of real traffic?

This is with real traffic. It's overall constant, but we do get generally lower traffic overnight. The spikes are consistently every 15-20 minutes, but not exactly.

Screenshot 2024-10-17 at 8 20 53 AM Screenshot 2024-10-17 at 8 18 43 AM
enisdenjo commented 1 week ago

We're running the gateway on a docker container in a kubernetes pod. The docker container is running node 20.14.0.

We had some fights with Node and memory spikes in the past. Am wondering whether that's the case here too? Can we start by updating the Node in the container to the upcoming LTS (starting tomorrow) v22.10.0?

It's overall constant, but we do get generally lower traffic overnight. The spikes are consistently every 15-20 minutes, but not exactly.

Spikes are also during lower traffic?

jaffemd commented 1 week ago

Upgrading node to 22.10.0 fixed our issue. Thank you!