arrow benchmark CI: elevated build error rate since multiple days as of Conbench API / tooling ("May 2023 incident")

jgehrcke commented 1 year ago

Builds have been failing at a high rate recently: screenshot

This has been going on for a couple of days. Let's consider this resolved when the arrow CI build stability looks more healthy again.

We have so far seen a mixture of Conbench API interaction errors and client tooling weaknesses. Different symptoms:

jgehrcke commented 1 year ago

I put together a very basic visual representation of the resource contention in ursa-2, the EKS cluster running Arrow Conbench.

We see that the system has two available cores per machine, and it has two machines.

Around May 25 the 15-minute load average went from << 5 into an absolutely intolerable regime. The fact that any of the processes running on this machine got any work done at all can be considered pure luck. It's obvious that relative timing of events got disturbed crazily. So. The load average plot shows that the system is in dysfunctional state since May 25. OOM failures being one of many side effect of this.

Further we see two clear transitions, looking at CPU utilization of Conbench containers by host:

May 22, around 14:00 UTC

That's when I deployed a newer Conbench version, including some BMRT cache changes. Average CPU utilization increased a bit, as expected.

May 24, around 21:00 UTC

This is the point in time shortly before things went into an abysmal state.

21:06 UTC

The 'pods running' metric is a great proxy for looking at pod crashes / OOM kills:

The transition seems to be really clean/unambiguous.I tried to get a closer look:

Here we see precisely that the disturbance started at 21:05/6/7 UTC.

jgehrcke commented 1 year ago

We merged https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 at 21:03 UTC.

That had drastic effect on arrow-bci-schedule-and-publish. It transitioned from erroring out after ~1-2 minutes to timing out after ~10 minutes. The transition, zoomed out:

--

The last short build was May 24 21:00 UTC, the first long build was 21:12 UTC:

jgehrcke commented 1 year ago

Conclusion: we identified the main cause for things falling apart: an overload of the system with GET requests to api/compare/runs/id1...id2.

GET requests to api/compare/runs/id1...id2 probably got more expensive recently. Generating a response to such request can (seemingly) require much memory and much CPU (that's also not really unexpected: complex queries are emitted, a lot of data comes back).

Confidence in the conclusion comes mainly from careful timing comparison. Merging https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 was flipping a switch. How beautiful is that. Implicitly, via merging #114 we basically did bisection work before knowing that it's going to be needed.

With the CI job arrow-bci-schedule-and-publish we are/were swamping the system with such requests. In that state, most handlers are occupied with trying to respond to one such request. And they hinder each other, everything's starving.

Of course one could go much deeper into the analysis here. How exactly these execution units are competing for resources, and how backpressure effects propagate through the system. The point is: everything in this state is pathological, nothing is sane. Some things may still appear as if they work. But: we don't need to make things work when the 15 minute load average is 10x above the CPU core count. We need to work on not getting into this state.

Everything is funny in this state. Example: 420 seconds for login (incl retries):

[230601-00:34:54.801] [2702890] [root] INFO: Time to POST https://conbench.ursa.dev/api/login/: 420.73130 s

austin3dickey commented 1 year ago

Thank you for the fantastic analysis! To me, this suggests many possible solutions, but if one solution is to reduce the frequency of the schedule-and-publish pipeline making requests to /api/compare/runs, then IMO #98 is the most straightforward way to do so. As I mentioned in https://github.com/voltrondata-labs/arrow-benchmarks-ci/issues/118#issuecomment-1570509936, doing so will cut the usage of that endpoint by 4, and potentially allow for other improvements as well.

jonkeane commented 1 year ago

Confidence in the conclusion comes mainly from careful timing comparison. Merging https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 was flipping a switch. How beautiful is that. Implicitly, via merging https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 we basically did bisection work before knowing that it's going to be needed.

Could you say a little bit more about this? I think I might be misinterpreting what you're intending here

jonkeane commented 1 year ago

but if one solution is to reduce the frequency of the schedule-and-publish pipeline making requests to /api/compare/runs

This is implicit in what you say there Austin, but I wanted to make this explicit just in case anyone might be misinterpreting (though of course correct me if I'm the one who has misinterpreted this!). What you're suggesting here is reducing the number of times that the compare endpoint is called within one run of the schedule-and-publish pipeline. But you're not suggesting that we need to change the current 10-minute interval of the schedule-and-publish pipeline. This makes a lot of sense because although the schedule-and-publish pipeline runs every 10 minutes, it actually only interacts with the compare endpoint if there are results that are done + need to be compared (so it's not actually running 4 compares every 10 minutes, but rather 4 compares when a run is ready to be compared — to be concrete, this run is one example where no compares were called because no runs were ready for that).

jgehrcke commented 1 year ago

Could you say a little bit more about this?

Without the tiny bug that https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 fixed we would have seen the resource contention two days earlier, at the same time as deploying the new Conbench version to conbench.ursa.dev. It would have been difficult to say which of the many changes dominates in the resource contention. But with #114 we had enough clear separation to say that the it was the tiny change in #114 which was required (not sufficient) to trigger the extreme resource contention. That allowed for concluding that the api/compare endpoint has major influence in the resource contention.

jonkeane commented 1 year ago

Without the tiny bug that https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 fixed we would have seen the resource contention two days earlier, at the same time as deploying the new Conbench version to conbench.ursa.dev. It would have been difficult to say which of the many changes dominates in the resource contention. But with https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 we had enough clear separation to say that the it was the tiny change in https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/114 which was required (not sufficient) to trigger the extreme resource contention. That allowed for concluding that the api/compare endpoint has major influence in the resource contention.

Ah interesting — if my reading of the code is correct — we might even be able to be more specific: It's my understanding that the bug that #114 fixed prevented 3 of the 4 compare calls, but the first one was executed (since that bug was dealing with the response). Suggesting (though certainly not conclusively determining, of course), that processing one of these requests at a time is intensive, but not so problematic that the whole system goes down, but when you add in the subsequent calls, that's when things go bad.

austin3dickey commented 1 year ago

What you're suggesting here is reducing the number of times that the compare endpoint is called within one run of the schedule-and-publish pipeline. But you're not suggesting that we need to change the current 10-minute interval of the schedule-and-publish pipeline.

That's correct. But we can always revisit the frequency of that pipeline once we switch to benchalerts and collect more data on how long it's taking.

jgehrcke commented 1 year ago

15 min load average is still spiking to intolerable regimes periodically, every few hours (triggered by CI):

Each such incident still results in rapid memory usage growth, and in a OOM events:

Each of these phases of overload last about ~30 minutes.

jgehrcke commented 1 year ago

finally got hands on detailed OOM context (these details quickly become invisible after some hours of time)

Events:
  Type     Reason               Age                From               Message
  ----     ------               ----               ----               -------
  Warning  FailedScheduling     38m (x6 over 43m)  default-scheduler  0/2 nodes are available: 1 Insufficient memory, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.
  Warning  FailedScheduling     36m (x2 over 37m)  default-scheduler  0/2 nodes are available: 2 Insufficient memory.
  Normal   Scheduled            36m                default-scheduler  Successfully assigned default/conbench-deployment-f7f7b74bc-4j5cg to ip-172-31-43-148.us-east-2.compute.internal
  Normal   Pulling              36m                kubelet            Pulling image "415900421099.dkr.ecr.us-east-2.amazonaws.com/conbench:067f1645b061ffb742e0aac7a138072a5b000cbb"
  Normal   Pulled               36m                kubelet            Successfully pulled image "415900421099.dkr.ecr.us-east-2.amazonaws.com/conbench:067f1645b061ffb742e0aac7a138072a5b000cbb" in 185.704537ms
  Normal   Created              36m                kubelet            Created container conbench
  Normal   Started              36m                kubelet            Started container conbench
  Warning  Evicted              28m                kubelet            The node was low on resource: memory. Container conbench was using 3975828Ki, which exceeds its request of 2500Mi.
  Normal   Killing              28m                kubelet            Stopping container conbench
  Warning  ExceededGracePeriod  28m                kubelet            Container runtime did not kill the pod within specified grace period.

Container without limit is not touched as long as node is not low on memory. But once a node becomes low on memory, badaboom:

If a container exceeds its memory request and the node that it runs on becomes short of memory overall, it is likely that the Pod the container belongs to will be evicted.

(https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#how-pods-with-resource-limits-are-run)

Summary:

Annotations:      kubernetes.io/psp: eks.privileged
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: memory. Container conbench was using 3975828Ki, which exceeds its request of 2500Mi. 
IP:               172.31.47.184
IPs:
  IP:           172.31.47.184
Controlled By:  ReplicaSet/conbench-deployment-f7f7b74bc

jgehrcke commented 1 year ago

adding 2 nodes with 16 GB memory to ursa-2. idea: 8 GB x 2 pods_per_conbench x 2 conbenchs -> 32 GB we should request something like 6 GB per conbench container. cannot just now bump the requested amount of memory because then things become not schedulable on vd-2. so we leave the request at 2.5 gb for now (can bump when we increase resource footprint for vd-2, too).

jgehrcke commented 1 year ago

The deadly 15 min load average spikes are gone for now:

But this picture also sheds more light on the memory leak topic: each time a CPU utilization spike occurs there's certain growth in RSS that doesn't quite look reasonable.

jgehrcke commented 1 year ago

A new symptom: https://github.com/voltrondata-labs/arrow-benchmarks-ci/issues/129 (adding to the list in the first comment here).

jgehrcke commented 1 year ago

@jonkeane made progress on https://github.com/voltrondata-labs/arrow-benchmarks-ci/issues/131 which is great.

Other benchmark builds were largely green in the recent 24 hours. There were a few red builds as of 'agent lost'.

Good progress here.

jgehrcke commented 1 year ago

Conbench pods are not crashing anymore:

Memory consumption has stabilized:

15-min load average has returned to be in a healthy regime:

austin3dickey commented 1 year ago

After fixing all the issues linked in the original post here, and since all benchmark builds are consistently green again, I think we can close this!

voltrondata-labs / arrow-benchmarks-ci