Open philipjyoon opened 8 months ago
Turns out supervisord.conf.mozrt
file doesn't make any difference in our system. Terraform needs to be modified for it to take effect.
There are two remote-exec provisioners in mozart.tf that together perform the necessary task. The issue is that the file copy must happen before running sds -d update mozart
but provisioners run in random order. So we need to create dependency to guarantee that it will work
https://github.com/nasa/opera-sds-pcm/blob/develop/cluster_provisioning/modules/common/mozart.tf#L442 should always happen before https://github.com/nasa/opera-sds-pcm/blob/develop/cluster_provisioning/modules/common/mozart.tf#L544
Checked for duplicates
Yes - I've already checked
Alternatives considered
Yes - and alternatives don't suffice
Related problems
No response
Describe the feature request
Mozart runs on a 16-core machine. We currently use 32 as numproc for
orchestrator_datasets
andorchestrator_jobs
programs on supervisord. We do not need this many celery workers - all they do is move jobs from one queue to another after performing some checks like dedup.When we have a spike in number of jobs coming into the system - we get a large spike in HLS data around midnight every day - all 32 celery workers spin up to move jobs between queues. This starves other services on Mozart of resources, mainly rabbitmq, and results in rabbitmq not able to make connections with all the celery workers. This then results in those workers and the supporting jobs fail.
We can safely reduce the count to 16. This will have virtually no impact on the system data processing throughput. It would only delay queue transactions by 1-2x but queue transaction time is still orders of magnitude faster than the actual PGE processing time. For OPERA mission, there is no need to have high-frequency queue (and job) management.