Open rickbijkerk opened 10 months ago
The depends_on
is really only for the docker-compose setup.
We do deploy Hive on k8s and point startupProbe
, livenessProbe
and readinessProbe
to 2 endpoints.
/_health
- I'm alive!/_readiness
- I can do work!These two services are independent from each other, you can setup self-hosted Hive in similar fashion and it should work for you as well.
i have done some more playing around with it and after configuring the probes i still ran into some issues. Specifically: "KafkaJSNonRetriableError" errors straight after start up in both usage and usage-ingestor.
What in the end worked for me was pointing the livenessProbe
to the /_readiness
which made sense after i looked into the code and found out that only the readiness endpoint looks at the state of the kafka connection.
And since the livenessProbe
is the one that kubernetes uses to determine wether or not to restart a container this fixed it.
Now when all the pods/containers start the usage
/usage-ingestor
will start and fail to be come 'live' and then do a single restart which does work because by then the broker/zookeeper pods have started.
And last but not least i think since your using pulumi you might not run into this as pulumi has the concept of depdencies which we sadly dont
hit same issue today, configuring liveness probe to listen on readiness endpoint does not seem like the best idea imo.
chatted with @Elyytscha to get more logs and understand what is going one. What I think we can do is add a retry limit here, so once we exceed the limit it kills the service this way the orchestrator can re-create the pods.
Right now this logic we have can keep trying forever but what I learned from the logs is that Kafka has a limit on how much you can retry.
Internal slack reference: https://guild-oss.slack.com/archives/C040PLJJJ02/p1727793746436379
We're self hosting hive through kubernetes. In kubernetes each pod (container) can be restarted at any time. The concept of 'depends_on:' as defined in the docker-compose doesnt exist within kubernetes landscape. Each container should be able to deal with a restart of it's dependencies. This works nicely for all containers except for the beforementioned usage/usage-ingestor.
For usage this leads to the following logs:
For the usage-ingestor pod the logs look a bit more extensive:
Reproduction:
Expected solution: