Closed russellkgw closed 2 months ago
Well that seems bad...
Is this a new data processing server running Ubuntu 24.04, the 23.10 stopgap we had for a few months, or is it the older 18.04 one? (if its 18.04 definitely deploy new one using the launch script.)
I'll take a look but I don't think we've had a complete system collapse like that
(Are supervisord and celery processes/subprocesses running? mightneed to run the processing-stop
command, wait for them to all die, then processing-start
. Probably not though given the stack traces.)
Hi Eli
Thanks for getting back to me. In December I noticed that the worker OS was out of support, and I had some extra time so I went ahead and updated it to 20.04.6 LTS. This process was not as painless as what I would have hoped for due to some of the required libs not playing ball, but I managed to get the service up and running again.
I left the data-manager on 18.04.6 LTS due to the issue above.
htop
reports multiple celery processes running at this time.
Ahh yes the launch script 💡, maybe it would be more straightforward to just bring up a new worker from scratch.
Would it be safe to also spin up a new data processing manager ? That is first remove the existing manager then add via: python launch_script.py -create-manager
?
(I'll respond to your actual issue in the next response, bear with.)
Oh, apologies for that struggle.
I guess it's not as clear as it should be - that's an important piece of feedback - part of the intent of the intent of the platform is to lower the barrier of entry with respect to the the system administration load, to make it not as intense for people who are not the world's expert in Beiwe. I'm going to make a note that this needs to be surfaced better.
I do periodic updates to the launch script when there are platform level upgrades, and any time there are high level changes like an updated supervisord
or celery
configuration. (And also when Amazon decides to change the identifier format for their Ubuntu images 🙄.)
This March/April I did a bunch of work updating dependencies and the Ubuntu platform, there was a transient 23.10 version while we waited on 24.04, which should now be the version that is deployed on the main branch.
The intended pattern-of-work for you is:
main
branch.eb deploy
command to update the web servers with current main
code.eb deploy
may also update the database schema, so it has to finish before you create a new manager/worker server.)I also build launch script commands that apply to major technical and migration hassles, like the Python 3.6->3.8 Elastic Beanstalk platform update has a -clone-environment
command, for which there will be an issue posted and a wiki directions article explicitly created -- and there will be a similar Python 3.8 -> 3.11 update by October of this year, so watch this space.
I'm going to make a new tag for relevant issues, "Infrastructure" (will tend to be paired with the "ANNOUNCEMENT" tag) for system admin level questions. Maybe that will help some. I also try to pin big items.
It seems we need to find some more ways to make this clear and encourage posts about those tasks to improve the SEO.
Ahh yes the launch script 💡, maybe it would be more straightforward to just bring up a new worker from scratch.
Would it be safe to also spin up a new data processing manager ? That is first remove the existing manager then add via:
python launch_script.py -create-manager
?
Yes. Follow the pattern I described in above, if you run into any immediate problems you are welcome to email me directly - username at gmail - and post an issue. Deployment problems will get very high priority, if something is absolutely unfathomable I am allowed some time for direct debugging outside of platform development.
I also JUST merged the new Heartbeat feature into main, you can see a description of it on this announcement post:
(so make sure to pull)
Hi Eli
Thank you for the details, this insight is valuable. With that it would seem that our deployment process has not been the best. I followed your instructions above and removed the old manager and worker (terminate-processing-servers
) and replaced them with newer versions.
Things seem to be working as expected 🎉. Thank you again for your efforts.
Closing this issue, thank you for the help.
(reopening because I need to remeber to update documentation/readme)
readme has been updated on a branch, re-closing this issue.
Hi Eli/Team
I hope you have had a good weekend.
We are at a bit of an impasse at the moment, I came across an unexpected error in the processing of the celery push notification task. From the log file (
celery_push_send.log
) :Initially I thought this was just down to an old version of the code so I redeployed the latest version of
main
(without error). Our setup is that of the “scalable deployment”, 1 EB webserver, 1 data processing manager and 1 data processing server. With that I restarted rabbit (service rabbitmq-server restart
) and the data processing (processing-restart
) for good measure.But over the last few days we have not seen any processing (data and push notifications).
According to
rabbitmqctl status
the service is running:And on the data processing server the celery workers are up
ps -ef | grep celery
:From
tail -50 celery_push_send.log
:It would appear that it is connected to Rabbit. The
tail celery_processing.log
outputs a similar result. There have been no errors in Senty since deploying/restarting. Disk space on all the machines has not been exhausted.Have you perhaps seen something like this before? Are there any additional logs I can check?
I have limited Python service hosting experience so perhaps I am missing something trivial in this regard. Any assistance in getting our Beiwe service (and associated studies) back in shape will be much appreciated. Please let me know if I can provide any additional details.
Related doc: https://github.com/onnela-lab/beiwe-backend/wiki/Celery-troubleshooting
Thank you.
Cheers Russell