trieb-work / helm-charts

This repository hosts helm charts needed to deploy a production ready saleor backend service consisting of a celery task runner, redis DB, postgres DB and the saleor core server exposing the graphQL API
Apache License 2.0
32 stars 14 forks source link

Error: No nodes replied within time constraint #12

Open rrrnld opened 2 months ago

rrrnld commented 2 months ago

We're self-hosting saleor and running into issues with our celery deployment, where the worker appears to get stuck after a while. We're deploying to k8s and run celery workers like this:

celery -A saleor --app=saleor.celeryconf:app worker --loglevel=info --beat

This is taken from the config that was removed here: https://github.com/saleor/saleor/pull/13777

I can see the worker processes are running. It's also what this repo uses to deploy saleor: https://github.com/trieb-work/helm-charts/blob/fbe6ce6748c449f4a8889fa653063cafad3a4303/charts/saleor/templates/celery_deployment.yaml#L26-L52

Is this the correct way to? I'm asking because celery -A saleor --app=saleor.celeryconf:app is redundant for example. Also, shelling into the container and trying to inspect it via celery -A saleor --app=saleor.celeryconf:app inspect active or celery -A saleor --app=saleor.celeryconf:app status both fail, and the lifetime check here in this repo does not seem to be working at all.

Error: No nodes replied within time constraint

Any idea what might be wrong with our healthchecks / lifetime checks?

JannikZed commented 2 months ago

@rrrnld we honestly did not use the helm chart with the most recent Saleor versions, as we did move to the cloud deployment, but it did work before. So currently I don't have the capacity to test that again, but we will most likely try the self-hosted deployment again in the future.. we added this liveness checks to make really sure, that the workers are alive and the redis connection is still active and that used to work fine. How does the being stuck look like to you?