voltrondata-labs / arrow-benchmarks-ci

Benchmarks CI for Apache Arrow project
MIT License
0 stars 5 forks source link

tpch benchmarks on `ursa-i9-9960x` are sometimes freezing #173

Closed austin3dickey closed 3 months ago

austin3dickey commented 9 months ago

Three times since Saturday we've seen the ursa-i9-9960x benchmark job be canceled after the 6-hour timeout:

image

The Buildkite logs all end with

INFO:buildkite.benchmark.run:start child process: conbench tpch --iterations=3 --all=true --drop-caches=true --run-id=$RUN_ID --run-name="$RUN_NAME" --run-reason="$RUN_REASON"
# Received cancellation signal, interrupting
Terminated
🚨 Error: The command was interrupted by a signal
2024-01-08 13:32:01 DEBUG  Terminating bootstrap after cancellation with terminated

As far as I can tell, the job runs fine before this point.

Here's a recent good build and its Conbench run, vs a bad build and its Conbench run. I can see in the log timestamps that both builds took 2 hours to reach the start child process: conbench tpch line. And both Conbench runs have results for TPC-H queries 1-20 but the bad build is missing results for queries 21 and 22. (This is the same for the other bad builds). So my guess is that something starts to hang in query 21. It's hard for me to tell as we don't get logs and this doesn't happen every time.

For posterity, here's the code for the TPC-H benchmark.

austin3dickey commented 9 months ago

Also, the TPC-H benchmarks are running fine on the other machines.