High CPU utilization with large collection of past engagements

timbrigham-oc commented 1 week ago

Description My Ubuntu instance is seeing high CPU utilization from the Python instance running Caldera. This gets much more noticeable when there are a substantial number (~50 previously run operations in my testing) present, to the point where I get agent communication timeouts.

To Reproduce

Run multiple operations, at least some of which have 100+ steps.
Continue running / rerunning operations.
Over time the CPU utilization creeps up, and will eventually peg a single CPU core at 100%.
This drastically slows down responses, and can result in communication timeouts.

Testing Restarting the Caldera server process does not help, and will (fairly quickly) return to the same CPU utilization patterns. I have a small API script which lets me bulk remove previous operations.. Removing old runs decreases CPU utilization.

Expected behavior Formerly executed engagements should not have an impact on CPU utilization for ongoing processes. I am guessing that while the operation is being executed links from former operations are still being evaluated and consuming CPU cycles, or something similar.

Environment My test instance is based on the 5.0.0 tagged release, and includes a few customizations - https://github.com/mitre/magma/pull/55 https://github.com/mitre/magma/pull/53 https://github.com/mitre/magma/pull/60

elegantmoose commented 4 days ago

Hmm, I wonder if this it hitting the limits of the in-memory simple"database" Caldera uses. Do you have any profiling stats on the memory usage as well? I wondering if its constantly page swapping RAM.

*Ill admit, I dont think we have ever 50+ operations at 100+ steps.

timbrigham-oc commented 4 days ago

Yeah, I could see that being a limiting factor. It's only (unusably) sluggish when there is an active operation and a bunch of historical data. I'm pretty sure my memory utilization was under 20% when I viewed it in top but no screenshot for proof. :)

It's also definitely something single threaded in Python that's getting caught up. Only one of the multiple cores in my test instance will get pegged to 100%. Didn't make sense at first since the two core machine was only reporting ~55% total in the Azure console.

I'll include more details when I end back up in the same situation. Gotta love iterative development on a process that uses lateral movement.. Blows up these counts in a hurry.

mitre / caldera

High CPU utilization with large collection of past engagements #3008