onnela-lab / beiwe-backend

Beiwe is a smartphone-based digital phenotyping research platform. This is the Beiwe backend code
https://www.beiwe.org/
BSD 3-Clause "New" or "Revised" License
63 stars 43 forks source link

No new processing tasks after redeploy/restart #392

Closed russellkgw closed 2 months ago

russellkgw commented 3 months ago

Hi Eli/Team

I hope you have had a good weekend.

We are at a bit of an impasse at the moment, I came across an unexpected error in the processing of the celery push notification task. From the log file (celery_push_send.log) :

The full contents of the message body was:
'[["e7Yv-97s-0y-uePXrzxu-b:APA91bHVmlVEGOPn1NBTCS5Gsh-jVi10iqrOWIU6zMTsiCvnA28nvFFnMdEwkKn-7FJCLSe_Ag-QvF6iHzfnLL52Qy52-Naof_Y6HFk15gZ9E7n2m2HDKrqIpgqzbXW5w8MVEaZVagn6", ["cVAm4gNjmf3zJYbGBva7WBVv"], [2314]], {}, {"callbacks": null, "errbacks": null, "chain": null, "chord": null}]' (280b)

The full contents of the message headers:
{'lang': 'py', 'task': 'services.celery_push_notifications.celery_send_push_notification', 'id': '30015958-8e85-43b3-986b-9434c9e10837', 'shadow': None, 'eta': None, 'expires': '2024-05-26T20:47:30', 'group': None, 'group_index': None, 'retries': 0, 'timelimit': [None, None], 'root_id': '30015958-8e85-43b3-986b-9434c9e10837', 'parent_id': None, 'argsrepr': "['e7Yv-97s-0y-uePXrzxu-b:APA91bHVmlVEGOPn1NBTCS5Gsh-jVi10iqrOWIU6zMTsiCvnA28nvFFnMdEwkKn-7FJCLSe_Ag-QvF6iHzfnLL52Qy52-Naof_Y6HFk15gZ9E7n2m2HDKrqIpgqzbXW5w8MVEaZVagn6', ['cVAm4gNjmf3zJYbGBva7WBVv'], [2314]]", 'kwargsrepr': '{}', 'origin': 'gen16321@ip-172-...-210'}

The delivery info for this task is:
{'consumer_tag': 'None4', 'delivery_tag': 104, 'redelivered': False, 'exchange': '', 'routing_key': 'push_notifications'}
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/beiwe/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 658, in on_task_received
    strategy = strategies[type_]
KeyError: 'services.celery_push_notifications.celery_send_push_notification'
[2024-05-26 20:43:34,651: ERROR/MainProcess] Received unregistered task of type 'services.celery_push_notifications.celery_send_push_notification'.
The message has been ignored and discarded.

Did you remember to import the module containing this task?
Or maybe you're using relative imports?.

Initially I thought this was just down to an old version of the code so I redeployed the latest version of main (without error). Our setup is that of the “scalable deployment”, 1 EB webserver, 1 data processing manager and 1 data processing server. With that I restarted rabbit (service rabbitmq-server restart) and the data processing (processing-restart) for good measure.

But over the last few days we have not seen any processing (data and push notifications).

According to rabbitmqctl status the service is running:

[{pid,7819},
 {running_applications,
    [{rabbit,"RabbitMQ","3.6.10"},
…

And on the data processing server the celery workers are up ps -ef | grep celery :

… /home/ubuntu/.pyenv/versions/beiwe/bin/python -m celery -A services.celery_data_processing worker -Q data_processing --loglevel=info -Ofair --hostname=%h_processing --autoscale=10,2
… /home/ubuntu/.pyenv/versions/beiwe/bin/python -m celery -A services.celery_push_notifications worker -Q push_notifications --loglevel=info -Ofair --hostname=%h_notifications --concurrency=20 --pool=threads
… /home/ubuntu/.pyenv/versions/beiwe/bin/python -m celery -A services.scripts_runner worker -Q scripts_queue --loglevel=info -Ofair --hostname=%h_scripts --autoscale=10,2
… /home/ubuntu/.pyenv/versions/beiwe/bin/python -m celery -A services.scripts_runner worker -Q scripts_queue --loglevel=info -Ofair --hostname=%h_scripts --autoscale=10,2
… /home/ubuntu/.pyenv/versions/beiwe/bin/python -m celery -A services.scripts_runner worker -Q scripts_queue --loglevel=info -Ofair --hostname=%h_scripts --autoscale=10,2
… /home/ubuntu/.pyenv/versions/beiwe/bin/python -m celery -A services.celery_data_processing worker -Q data_processing --loglevel=info -Ofair --hostname=%h_processing --autoscale=10,2
… /home/ubuntu/.pyenv/versions/beiwe/bin/python -m celery -A services.celery_data_processing worker -Q data_processing --loglevel=info -Ofair --hostname=%h_processing --autoscale=10,2

From tail -50 celery_push_send.log:

-- ******* ---- Linux-5.15.0-1062-aws-x86_64-with-glibc2.2.5 2024-05-29 20:05:42
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app:         services.push_notification_send:0x7fc4ce199460
- ** ---------- .> transport:   amqp://beiwe:**@172…..210:50000//
- ** ---------- .> results:     rpc://
- *** --- * --- .> concurrency: 20 (thread)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
 -------------- [queues]
                .> push_notifications exchange=push_notifications(direct) key=push_notifications

[tasks]
  . services.celery_push_notifications.celery_heartbeat_send_push_notification
  . services.celery_push_notifications.celery_send_survey_push_notification

…

[2024-05-29 20:05:42,760: INFO/MainProcess] Connected to amqp://beiwe:**@172…..210:50000//
[2024-05-29 20:05:42,760: WARNING/MainProcess] /home/ubuntu/.pyenv/versions/beiwe/lib/python3.8/site-packages/celery/worker/consumer/consumer.py:507: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine
whether broker connection retries are made during startup in Celery 6.0 and above.
If you wish to retain the existing behavior for retrying connections on startup,
you should set broker_connection_retry_on_startup to True.
  warnings.warn(

[2024-05-29 20:05:42,776: INFO/MainProcess] mingle: searching for neighbors
[2024-05-29 20:05:43,817: INFO/MainProcess] mingle: sync with 1 nodes
[2024-05-29 20:05:43,817: INFO/MainProcess] mingle: sync complete
[2024-05-29 20:05:43,855: INFO/MainProcess] celery@ip-172-...-247_notifications ready.

It would appear that it is connected to Rabbit. The tail celery_processing.log outputs a similar result. There have been no errors in Senty since deploying/restarting. Disk space on all the machines has not been exhausted.

Have you perhaps seen something like this before? Are there any additional logs I can check?

I have limited Python service hosting experience so perhaps I am missing something trivial in this regard. Any assistance in getting our Beiwe service (and associated studies) back in shape will be much appreciated. Please let me know if I can provide any additional details.

Related doc: https://github.com/onnela-lab/beiwe-backend/wiki/Celery-troubleshooting

Thank you.

Cheers Russell

biblicabeebli commented 3 months ago

Well that seems bad...

Is this a new data processing server running Ubuntu 24.04, the 23.10 stopgap we had for a few months, or is it the older 18.04 one? (if its 18.04 definitely deploy new one using the launch script.)

I'll take a look but I don't think we've had a complete system collapse like that

(Are supervisord and celery processes/subprocesses running? mightneed to run the processing-stop command, wait for them to all die, then processing-start. Probably not though given the stack traces.)

russellkgw commented 3 months ago

Hi Eli

Thanks for getting back to me. In December I noticed that the worker OS was out of support, and I had some extra time so I went ahead and updated it to 20.04.6 LTS. This process was not as painless as what I would have hoped for due to some of the required libs not playing ball, but I managed to get the service up and running again.

I left the data-manager on 18.04.6 LTS due to the issue above.

htop reports multiple celery processes running at this time.

Ahh yes the launch script 💡, maybe it would be more straightforward to just bring up a new worker from scratch.

Would it be safe to also spin up a new data processing manager ? That is first remove the existing manager then add via: python launch_script.py -create-manager ?

biblicabeebli commented 3 months ago

(I'll respond to your actual issue in the next response, bear with.)


Oh, apologies for that struggle.

I guess it's not as clear as it should be - that's an important piece of feedback - part of the intent of the intent of the platform is to lower the barrier of entry with respect to the the system administration load, to make it not as intense for people who are not the world's expert in Beiwe. I'm going to make a note that this needs to be surfaced better.


I do periodic updates to the launch script when there are platform level upgrades, and any time there are high level changes like an updated supervisord or celery configuration. (And also when Amazon decides to change the identifier format for their Ubuntu images 🙄.)

This March/April I did a bunch of work updating dependencies and the Ubuntu platform, there was a transient 23.10 version while we waited on 24.04, which should now be the version that is deployed on the main branch.

The intended pattern-of-work for you is:

I also build launch script commands that apply to major technical and migration hassles, like the Python 3.6->3.8 Elastic Beanstalk platform update has a -clone-environment command, for which there will be an issue posted and a wiki directions article explicitly created -- and there will be a similar Python 3.8 -> 3.11 update by October of this year, so watch this space.

I'm going to make a new tag for relevant issues, "Infrastructure" (will tend to be paired with the "ANNOUNCEMENT" tag) for system admin level questions. Maybe that will help some. I also try to pin big items.

It seems we need to find some more ways to make this clear and encourage posts about those tasks to improve the SEO.

biblicabeebli commented 3 months ago

Ahh yes the launch script 💡, maybe it would be more straightforward to just bring up a new worker from scratch.

Would it be safe to also spin up a new data processing manager ? That is first remove the existing manager then add via: python launch_script.py -create-manager ?

Yes. Follow the pattern I described in above, if you run into any immediate problems you are welcome to email me directly - username at gmail - and post an issue. Deployment problems will get very high priority, if something is absolutely unfathomable I am allowed some time for direct debugging outside of platform development.

biblicabeebli commented 3 months ago

I also JUST merged the new Heartbeat feature into main, you can see a description of it on this announcement post:

(so make sure to pull)

russellkgw commented 3 months ago

Hi Eli

Thank you for the details, this insight is valuable. With that it would seem that our deployment process has not been the best. I followed your instructions above and removed the old manager and worker (terminate-processing-servers) and replaced them with newer versions.

Things seem to be working as expected 🎉. Thank you again for your efforts.

russellkgw commented 3 months ago

Closing this issue, thank you for the help.

biblicabeebli commented 3 months ago

(reopening because I need to remeber to update documentation/readme)

biblicabeebli commented 2 months ago

readme has been updated on a branch, re-closing this issue.