Closed philipjyoon closed 2 weeks ago
The culprit was that we use a single cloudwatch agent config file for all verdi worker types. This single config file contains ~40 directories to look for log files even though any single verdi worker type would only actually write to 2 directories. So I guess the cloudwatch agent is highly inefficient looking through directories in search of log files. So when I 1) stopped the agent 2) modified the config json to remove ~38 directories and 3) restarted, the overall CPU usage of the agent processes went down to ~15% from ~200%. Changing interval
from 10s to 60s didn't seem to have any effect and nor did changing the flush_interval
So I think the solution for us to create one unique cloudwatch agent config file per verdi worker type and only include the directories that they need to read. This would probably also require us to create multiple launch templates because that's what installs these config files.
To stop the agent running in a verdi worker:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -m ec2 -a stop
To start the agent specifying the config file:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/home/ops/test.json
Just to complete the full circle here, after running the agent w the "slimmer" config file, I restarted it using the original "fatter" config file and observed that CPU utilization goes back up to 200%. So the reduction wasn't just a case of restarting the agent.
We do use multiple launch templates, one per ASG. But they all use a single static user_data which contains that single agent config file. So it would be pretty easy for us to create multiple user_data templates with unique agent configs, and then dynamically assign them to ASG in terraform like other launch template configurations.
Update: That screenshot of htop
shows threads in each process as if they were also processes. That's how I added them up to 200% CPU. The CPU utilization of the cloud agent process before this fix peaked around 120% and was on average 100%
Checked for duplicates
Yes - I've already checked
Describe the bug
When running DISP-S1 PGE with m=6 and k=15, we're seeing up to 200% CPU utilization from several cloudwatch processes in the verdi worker. The machine is 16-core with 128GB memory.
Cloudwatch shouldn't be using more than 50% of a core
What did you expect?
n/t
Reproducible steps
Environment