[Bug]: cloudwatch on SCIFLO PGE is using up to 120% of CPU

philipjyoon commented 1 month ago

Checked for duplicates

Yes - I've already checked

Describe the bug

When running DISP-S1 PGE with m=6 and k=15, we're seeing up to 200% CPU utilization from several cloudwatch processes in the verdi worker. The machine is 16-core with 128GB memory.

Cloudwatch shouldn't be using more than 50% of a core

What did you expect?

n/t

Reproducible steps

1.
2.
3.
...

Environment

- Version of this software [e.g. vX.Y.Z]
- Operating System: [e.g. MacOSX with Docker Desktop vX.Y]
...

philipjyoon commented 1 month ago

The culprit was that we use a single cloudwatch agent config file for all verdi worker types. This single config file contains ~40 directories to look for log files even though any single verdi worker type would only actually write to 2 directories. So I guess the cloudwatch agent is highly inefficient looking through directories in search of log files. So when I 1) stopped the agent 2) modified the config json to remove ~38 directories and 3) restarted, the overall CPU usage of the agent processes went down to ~15% from ~200%. Changing interval from 10s to 60s didn't seem to have any effect and nor did changing the flush_interval

So I think the solution for us to create one unique cloudwatch agent config file per verdi worker type and only include the directories that they need to read. This would probably also require us to create multiple launch templates because that's what installs these config files.

To stop the agent running in a verdi worker: sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -m ec2 -a stop

To start the agent specifying the config file: sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/home/ops/test.json

philipjyoon commented 1 month ago

Just to complete the full circle here, after running the agent w the "slimmer" config file, I restarted it using the original "fatter" config file and observed that CPU utilization goes back up to 200%. So the reduction wasn't just a case of restarting the agent.

philipjyoon commented 1 month ago

We do use multiple launch templates, one per ASG. But they all use a single static user_data which contains that single agent config file. So it would be pretty easy for us to create multiple user_data templates with unique agent configs, and then dynamically assign them to ASG in terraform like other launch template configurations.

philipjyoon commented 1 month ago

Update: That screenshot of htop shows threads in each process as if they were also processes. That's how I added them up to 200% CPU. The CPU utilization of the cloud agent process before this fix peaked around 120% and was on average 100%

nasa / opera-sds-pcm