Closed jgehrcke closed 1 year ago
I put together a very basic visual representation of the resource contention in ursa-2, the EKS cluster running Arrow Conbench.
We see that the system has two available cores per machine, and it has two machines.
Around May 25 the 15-minute load average went from << 5
into an absolutely intolerable regime. The fact that any of the processes running on this machine got any work done at all can be considered pure luck. It's obvious that relative timing of events got disturbed crazily. So. The load average plot shows that the system is in dysfunctional state since May 25. OOM failures being one of many side effect of this.
Further we see two clear transitions, looking at CPU utilization of Conbench containers by host:
That's when I deployed a newer Conbench version, including some BMRT cache changes. Average CPU utilization increased a bit, as expected.
This is the point in time shortly before things went into an abysmal state.
The 'pods running' metric is a great proxy for looking at pod crashes / OOM kills:
The transition seems to be really clean/unambiguous.I tried to get a closer look:
Here we see precisely that the disturbance started at 21:05/6/7 UTC.
We merged https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 at 21:03 UTC.
That had drastic effect on arrow-bci-schedule-and-publish
. It transitioned from erroring out after ~1-2 minutes to timing out after ~10 minutes. The transition, zoomed out:
--
The last short build was May 24 21:00 UTC, the first long build was 21:12 UTC:
Conclusion: we identified the main cause for things falling apart: an overload of the system with GET requests to api/compare/runs/id1...id2
.
GET requests to api/compare/runs/id1...id2
probably got more expensive recently. Generating a response to such request can (seemingly) require much memory and much CPU (that's also not really unexpected: complex queries are emitted, a lot of data comes back).
Confidence in the conclusion comes mainly from careful timing comparison. Merging https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 was flipping a switch. How beautiful is that. Implicitly, via merging #114 we basically did bisection work before knowing that it's going to be needed.
With the CI job arrow-bci-schedule-and-publish
we are/were swamping the system with such requests. In that state, most handlers are occupied with trying to respond to one such request. And they hinder each other, everything's starving.
Of course one could go much deeper into the analysis here. How exactly these execution units are competing for resources, and how backpressure effects propagate through the system. The point is: everything in this state is pathological, nothing is sane. Some things may still appear as if they work. But: we don't need to make things work when the 15 minute load average is 10x above the CPU core count. We need to work on not getting into this state.
Everything is funny in this state. Example: 420 seconds for login (incl retries):
[230601-00:34:54.801] [2702890] [root] INFO: Time to POST https://conbench.ursa.dev/api/login/: 420.73130 s
Thank you for the fantastic analysis! To me, this suggests many possible solutions, but if one solution is to reduce the frequency of the schedule-and-publish
pipeline making requests to /api/compare/runs
, then IMO #98 is the most straightforward way to do so. As I mentioned in https://github.com/voltrondata-labs/arrow-benchmarks-ci/issues/118#issuecomment-1570509936, doing so will cut the usage of that endpoint by 4, and potentially allow for other improvements as well.
Confidence in the conclusion comes mainly from careful timing comparison. Merging https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 was flipping a switch. How beautiful is that. Implicitly, via merging https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 we basically did bisection work before knowing that it's going to be needed.
Could you say a little bit more about this? I think I might be misinterpreting what you're intending here
but if one solution is to reduce the frequency of the
schedule-and-publish
pipeline making requests to/api/compare/runs
This is implicit in what you say there Austin, but I wanted to make this explicit just in case anyone might be misinterpreting (though of course correct me if I'm the one who has misinterpreted this!). What you're suggesting here is reducing the number of times that the compare endpoint is called within one run of the schedule-and-publish
pipeline. But you're not suggesting that we need to change the current 10-minute interval of the schedule-and-publish
pipeline. This makes a lot of sense because although the schedule-and-publish
pipeline runs every 10 minutes, it actually only interacts with the compare endpoint if there are results that are done + need to be compared (so it's not actually running 4 compares every 10 minutes, but rather 4 compares when a run is ready to be compared — to be concrete, this run is one example where no compares were called because no runs were ready for that).
Could you say a little bit more about this?
Without the tiny bug that https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 fixed we would have seen the resource contention two days earlier, at the same time as deploying the new Conbench version to conbench.ursa.dev. It would have been difficult to say which of the many changes dominates in the resource contention. But with #114 we had enough clear separation to say that the it was the tiny change in #114 which was required (not sufficient) to trigger the extreme resource contention. That allowed for concluding that the api/compare endpoint has major influence in the resource contention.
Without the tiny bug that https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 fixed we would have seen the resource contention two days earlier, at the same time as deploying the new Conbench version to conbench.ursa.dev. It would have been difficult to say which of the many changes dominates in the resource contention. But with https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 we had enough clear separation to say that the it was the tiny change in https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 which was required (not sufficient) to trigger the extreme resource contention. That allowed for concluding that the api/compare endpoint has major influence in the resource contention.
Ah interesting — if my reading of the code is correct — we might even be able to be more specific: It's my understanding that the bug that #114 fixed prevented 3 of the 4 compare calls, but the first one was executed (since that bug was dealing with the response). Suggesting (though certainly not conclusively determining, of course), that processing one of these requests at a time is intensive, but not so problematic that the whole system goes down, but when you add in the subsequent calls, that's when things go bad.
What you're suggesting here is reducing the number of times that the compare endpoint is called within one run of the schedule-and-publish pipeline. But you're not suggesting that we need to change the current 10-minute interval of the schedule-and-publish pipeline.
That's correct. But we can always revisit the frequency of that pipeline once we switch to benchalerts and collect more data on how long it's taking.
15 min load average is still spiking to intolerable regimes periodically, every few hours (triggered by CI):
Each such incident still results in rapid memory usage growth, and in a OOM events:
Each of these phases of overload last about ~30 minutes.
finally got hands on detailed OOM context (these details quickly become invisible after some hours of time)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 38m (x6 over 43m) default-scheduler 0/2 nodes are available: 1 Insufficient memory, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.
Warning FailedScheduling 36m (x2 over 37m) default-scheduler 0/2 nodes are available: 2 Insufficient memory.
Normal Scheduled 36m default-scheduler Successfully assigned default/conbench-deployment-f7f7b74bc-4j5cg to ip-172-31-43-148.us-east-2.compute.internal
Normal Pulling 36m kubelet Pulling image "415900421099.dkr.ecr.us-east-2.amazonaws.com/conbench:067f1645b061ffb742e0aac7a138072a5b000cbb"
Normal Pulled 36m kubelet Successfully pulled image "415900421099.dkr.ecr.us-east-2.amazonaws.com/conbench:067f1645b061ffb742e0aac7a138072a5b000cbb" in 185.704537ms
Normal Created 36m kubelet Created container conbench
Normal Started 36m kubelet Started container conbench
Warning Evicted 28m kubelet The node was low on resource: memory. Container conbench was using 3975828Ki, which exceeds its request of 2500Mi.
Normal Killing 28m kubelet Stopping container conbench
Warning ExceededGracePeriod 28m kubelet Container runtime did not kill the pod within specified grace period.
Container without limit is not touched as long as node is not low on memory. But once a node becomes low on memory, badaboom:
If a container exceeds its memory request and the node that it runs on becomes short of memory overall, it is likely that the Pod the container belongs to will be evicted.
Summary:
Annotations: kubernetes.io/psp: eks.privileged
Status: Failed
Reason: Evicted
Message: The node was low on resource: memory. Container conbench was using 3975828Ki, which exceeds its request of 2500Mi.
IP: 172.31.47.184
IPs:
IP: 172.31.47.184
Controlled By: ReplicaSet/conbench-deployment-f7f7b74bc
adding 2 nodes with 16 GB memory to ursa-2. idea: 8 GB x 2 pods_per_conbench x 2 conbenchs -> 32 GB we should request something like 6 GB per conbench container. cannot just now bump the requested amount of memory because then things become not schedulable on vd-2. so we leave the request at 2.5 gb for now (can bump when we increase resource footprint for vd-2, too).
The deadly 15 min load average spikes are gone for now:
But this picture also sheds more light on the memory leak topic: each time a CPU utilization spike occurs there's certain growth in RSS that doesn't quite look reasonable.
A new symptom: https://github.com/voltrondata-labs/arrow-benchmarks-ci/issues/129 (adding to the list in the first comment here).
@jonkeane made progress on https://github.com/voltrondata-labs/arrow-benchmarks-ci/issues/131 which is great.
Other benchmark builds were largely green in the recent 24 hours. There were a few red builds as of 'agent lost'.
Good progress here.
Conbench pods are not crashing anymore:
Memory consumption has stabilized:
15-min load average has returned to be in a healthy regime:
After fixing all the issues linked in the original post here, and since all benchmark builds are consistently green again, I think we can close this!
Builds have been failing at a high rate recently: screenshot
This has been going on for a couple of days. Let's consider this resolved when the arrow CI build stability looks more healthy again.
We have so far seen a mixture of Conbench API interaction errors and client tooling weaknesses. Different symptoms: