graphql-hive / platform

GraphQL platform - Schema registry, analytics and gateway for GraphQL federation and other GraphQL APIs
https://the-guild.dev/graphql/hive
MIT License
421 stars 100 forks source link

Usage/Usage-ingestor container cant deal with restart of 'broker' #3808

Open rickbijkerk opened 9 months ago

rickbijkerk commented 9 months ago

We're self hosting hive through kubernetes. In kubernetes each pod (container) can be restarted at any time. The concept of 'depends_on:' as defined in the docker-compose doesnt exist within kubernetes landscape. Each container should be able to deal with a restart of it's dependencies. This works nicely for all containers except for the beforementioned usage/usage-ingestor.

For usage this leads to the following logs:

"msg":"[503] (::ffff:127.0.0.6) POST / (reqId=f74caede-fb53-4c47-a0fb-d2c7c720b657)"}
"msg":"Not ready to collect report (token=989••••••••••••••••••••••••••fd7)"}

For the usage-ingestor pod the logs look a bit more extensive:

"broker":"hive-broker-svce:29092","clientId":"usage-ingestor","error":"The group coordinator is not available","correlationId":4,"size":55,"msg":"[Connection] Response GroupCoordinator(key: 10, version: 2)","time":1705401577124}
"broker":"hive-broker-svce:29092","clientId":"usage-ingestor","error":"The group coordinator is not available","correlationId":5,"size":55,"msg":"[Connection] Response GroupCoordinator(key: 10, version: 2)","time":1705401577507}
"broker":"hive-broker-svce:29092","clientId":"usage-ingestor","error":"The group coordinator is not available","correlationId":6,"size":55,"msg":"[Connection] Response GroupCoordinator(key: 10, version: 2)","time":1705401578130}
"broker":"hive-broker-svce:29092","clientId":"usage-ingestor","error":"The group coordinator is not available","correlationId":7,"size":55,"msg":"[Connection] Response GroupCoordinator(key: 10, version: 2)","time":1705401579163}
"broker":"hive-broker-svce:29092","clientId":"usage-ingestor","error":"The group coordinator is not available","correlationId":8,"size":55,"msg":"[Connection] Response GroupCoordinator(key: 10, version: 2)","time":1705401581205}
"broker":"hive-broker-svce:29092","clientId":"usage-ingestor","error":"The group coordinator is not available","correlationId":9,"size":55,"msg":"[Connection] Response GroupCoordinator(key: 10, version: 2)","time":1705401584505}
"groupId":"usage-ingestor-v2","stack":"KafkaJSGroupCoordinatorNotFound: Failed to find group coordinator\n    at Cluster.findGroupCoordinatorMetadata (file:///usr/src/app/index.js:74709:15)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async file:///usr/src/app/index.js:74644:37\n    at async [private:ConsumerGroup:join] (file:///usr/src/app/index.js:76901:28)\n    at async file:///usr/src/app/index.js:77034:13\n    at async Runner.start (file:///usr/src/app/index.js:77694:11)\n    at async start (file:///usr/src/app/in.js:78348:11)","msg":"[Consumer] Crash: KafkaJSGroupCoordinatorNotFound: Failed to find group coordinator","time":1705401584506}

Consumer stopped
[Consumer] Stopped","time":170540158450
Consumer disconnected
Consumer crashed (restart=false, error=KafkaJSGroupCoordinatorNotFound: Failed to find group coordinator)
Restarting consumer...
Starting Usage Ingestor...
Connecting Kafka Consumer
Subscribing to Kafka topic: usage_reports_v2

broker":"hive-broker-svce:29092","clientId":"usage-ingestor","stack":"Error [ERR_STREAM_WRITE_AFTER_END]: write after end\n    at new NodeError (node:internal/errors:405:5)\n    at _write (node:internal/streams/writable:322:11)\n    at Writable.write (node:internal/streams/writable:337:10)\n    at Object.sendRequest (file:///usr/src/app/index.js:74088:31)\n    at SocketRequest.send [as sendRequest] (file:///usr/src/app/index.js:72825:27)\n    at SocketRequest.send (file:///usr/src/app/index.js:72644:14)\n    at RequestQueue.sendSocketRequest (file:///usr/src/app/index.js:72865:23)\n    at RequestQueue.push (file:///usr/src/app/index.js:72849:16)\n    at file:///usr/src/app/index.js:74083:33\n    at new Promise (<anonymous>)","msg":"[Connection] Connection error: write after end","time":1705401584509}

{"level":50,"time":1705401584510,"pid":14,"hostname":"hive-usage-deployment-6d75fc64b-7qqzz","logger":"kafkajs","eventName":"consumer.crash","stack":"KafkaJSConnectionError: Connection error: write after end\n    at Socket.onError (file:///usr/src/app/index.js:73919:27)\n    at Socket.emit (node:events:517:28)\n    at Socket.emit (node:domain:489:12)\n    at emitErrorNT (node:internal/streams/destroy:151:8)\n    at emitErrorCloseNT (node:internal/streams/destroy:116:3)\n    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)","msg":"[Consumer] Failed to execute listener: Connection error: write after end","time":1705401584510}

Reproduction:

Expected solution:

kamilkisiela commented 7 months ago

The depends_on is really only for the docker-compose setup. We do deploy Hive on k8s and point startupProbe, livenessProbe and readinessProbe to 2 endpoints.

These two services are independent from each other, you can setup self-hosted Hive in similar fashion and it should work for you as well.

rickbijkerk commented 6 months ago

i have done some more playing around with it and after configuring the probes i still ran into some issues. Specifically: "KafkaJSNonRetriableError" errors straight after start up in both usage and usage-ingestor.

What in the end worked for me was pointing the livenessProbe to the /_readiness which made sense after i looked into the code and found out that only the readiness endpoint looks at the state of the kafka connection. And since the livenessProbe is the one that kubernetes uses to determine wether or not to restart a container this fixed it.

Now when all the pods/containers start the usage/usage-ingestor will start and fail to be come 'live' and then do a single restart which does work because by then the broker/zookeeper pods have started.

And last but not least i think since your using pulumi you might not run into this as pulumi has the concept of depdencies which we sadly dont

Elyytscha commented 3 weeks ago

hit same issue today, configuring liveness probe to listen on readiness endpoint does not seem like the best idea imo.

saihaj commented 3 weeks ago

chatted with @Elyytscha to get more logs and understand what is going one. What I think we can do is add a retry limit here, so once we exceed the limit it kills the service this way the orchestrator can re-create the pods.

https://github.com/kamilkisiela/graphql-hive/blob/a0ee93f884c97b7f02fddc5395b9bfbc1d3f860a/packages/services/usage-ingestor/src/ingestor.ts#L114-L124

Right now this logic we have can keep trying forever but what I learned from the logs is that Kafka has a limit on how much you can retry.

Internal slack reference: https://guild-oss.slack.com/archives/C040PLJJJ02/p1727793746436379