nasa / opera-sds-pcm

Observational Products for End-Users from Remote Sensing Analysis (OPERA)
Apache License 2.0
16 stars 12 forks source link

Reduce number of celery workers on Mozart #793

Open philipjyoon opened 8 months ago

philipjyoon commented 8 months ago

Checked for duplicates

Yes - I've already checked

Alternatives considered

Yes - and alternatives don't suffice

Related problems

No response

Describe the feature request

Mozart runs on a 16-core machine. We currently use 32 as numproc for orchestrator_datasets and orchestrator_jobs programs on supervisord. We do not need this many celery workers - all they do is move jobs from one queue to another after performing some checks like dedup.

When we have a spike in number of jobs coming into the system - we get a large spike in HLS data around midnight every day - all 32 celery workers spin up to move jobs between queues. This starves other services on Mozart of resources, mainly rabbitmq, and results in rabbitmq not able to make connections with all the celery workers. This then results in those workers and the supporting jobs fail.

We can safely reduce the count to 16. This will have virtually no impact on the system data processing throughput. It would only delay queue transactions by 1-2x but queue transaction time is still orders of magnitude faster than the actual PGE processing time. For OPERA mission, there is no need to have high-frequency queue (and job) management.

philipjyoon commented 7 months ago

Turns out supervisord.conf.mozrt file doesn't make any difference in our system. Terraform needs to be modified for it to take effect.

philipjyoon commented 7 months ago

There are two remote-exec provisioners in mozart.tf that together perform the necessary task. The issue is that the file copy must happen before running sds -d update mozart but provisioners run in random order. So we need to create dependency to guarantee that it will work

https://github.com/nasa/opera-sds-pcm/blob/develop/cluster_provisioning/modules/common/mozart.tf#L442 should always happen before https://github.com/nasa/opera-sds-pcm/blob/develop/cluster_provisioning/modules/common/mozart.tf#L544