Restart Worker when Celery fails

Denis-Gavrielov commented 5 years ago

There seems to be an issue with celery that the worker stops listening to request if a celery heartbeat is missed. These are some threats that I found where people seemed to have similar issues: celery/celery#4997 celery/celery#4185 analyseether/ether_sql#42 celery/celery#2296

Right now, there is a cronjob that checks every hour if the worker's log shows a missed heartbeat and restarts the container in that case. This works fine for now, but there might be more elegant solutions for the future.

waqas-ali-pk commented 3 years ago

@Denis-Gavrielov That's great idea indeed, to run cronjob after every hour and check status of workers. Are you restarting that specific worker/container, or restarting everything?

Can you share cronjob details/code, how you have implemented it for your case. I want to implement same solution, if you share stuff, it will be of great helpful.

Denis-Gavrielov commented 3 years ago

Hi @waqas-ali-pk It has been a while that I have worked with this project. If there is still a cronjob configured then probably via Ansible. Let me know if that was helpful.

andronat commented 3 years ago

Hey @waqas-ali-pk I think this is what you are looking for. Honestly there another cronjob that I've been using recently but recently I'm finding very difficult to find time and make a proper PR.

Nevertheless, I just paste what I'm using in a quick a dirty way here:

- name: "Restart {{ kleeweb_worker_container }} if celery heartbeat missed"
  cron:
    name: "Restart {{ kleeweb_worker_container }} if celery heartbeat missed"
    minute: "30"
    job: 'if [ $(sudo docker logs --tail=1 {{ kleeweb_worker_container }} | grep "missed heartbeat from celery" | wc -l) -eq 1 ]; then $(sudo docker restart {{ kleeweb_worker_container }}); fi'
    user: klee
  when: not ci

- name: "Kill all remaining klee containers every day"
  cron:
    name: "Kill all remaining klee containers every day"
    minute: "0"
    hour: "0"
    job: "sudo docker ps --filter ancestor=klee/klee -q | xargs sudo docker kill"
    user: klee
  when: not ci

- name: "Restart {{ kleeweb_worker_container }} every day"
  cron:
    name: "Restart {{ kleeweb_worker_container }} every day"
    minute: "0"
    hour: "0"
    job: "sudo docker restart {{ kleeweb_worker_container }}"
    user: klee
  when: not ci

waqas-ali-pk commented 3 years ago

Hi @andronat Thanks for sharing this!! I also have another scenario, sometimes celery worker stop without showing any error message, how that case be handled. I really appreciate your help on this.

How we can restart only that specific worker if we do not use docker?

andronat commented 3 years ago

Hi @andronat Thanks for sharing this!! I also have another scenario, sometimes celery worker stop without showing any error message, how that case be handled. I really appreciate your help on this.

Hm, well in general I was hopping to find time to upgrade to latest Celery as it seems to be more robust (e.g. heartbeats). But I never managed. PRs are always welcomed 😃. So definitely two things I can think of: 1) you could consider put the time to upgrade to latest Celery, 2) you could set a standard time point that you just blindly restart workers. Not all of them together, maybe with a rolling strategy.

How we can restart only that specific worker if we do not use docker?

Well that depends on how you run the celery project on your infrastructure. Docker in general is the easy way out and I highly recommend it.

waqas-ali-pk commented 3 years ago

@andronat Thanks!! This is helpful.

klee / klee-web

Restart Worker when Celery fails #135