Open jgehrcke opened 1 year ago
Did
FAILED R tpch
fail because of the TLS failure shown above?
I believe the answer is no for two reasons. First, the Failed to post to
log string only appears in this codebase at the end of this function:
That function looks like it should never raise an exception; it should just log any exceptions that occur. Though of course, due to #134, that doesn't necessarily mean that "not raising an exception" will turn into "benchmarks are marked as successful". So that brings me to my second point.
Second, I think I found the "true" benchmark failures by searching "error":
in the build log. (That is, I was effectively searching through the posted JSON for any benchmark result payloads that contain the "error" key.) For the TPCH-21 query at scale factor 10, both native
and parquet
formats, we see these logs:
[230612-02:07:08.497] [7491] [root] ERROR: {"timestamp": "2023-06-12T02:07:08.496816+00:00", "tags": {"cpu_count": null, "engine": "arrow", "memory_map": false, "query_id": "TPCH-21", "scale_factor": 10, "format": "native", "language": "R", "name": "tpch"}, "info": {"arrow_version": "13.0.0-SNAPSHOT", "arrow_compiler_id": "GNU", "arrow_compiler_version": "11.4.0", "benchmark_language_version": "R version 4.2.3 (2023-03-15)", "arrow_version_r": "12.0.0.9000"}, "context": {"arrow_compiler_flags": "-fvisibility-inlines-hidden -fmessage-length=0 -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -isystem /var/lib/buildkite-agent/.conda/envs/arrow-commit/include -fdiagnostics-color=always", "benchmark_language": "R"}, "error": "sh: line 1: 27111 Killed /var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/R/bin/R --no-save -s < '/tmp/Rtmp6Kv4nR/file69cb61a1309d' 2>&1", "command": "library(arrowbench); run_one(tpc_h, cpu_count=NULL, format='native', scale_factor=10, engine='arrow', memory_map=FALSE, query_id=21)"}
Traceback (most recent call last):
File "/var/lib/buildkite-agent/builds/aws-arm64-t4g-linux-compute-i-0d98ea5af31709bb4-1/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/benchmarks/benchmarks/_benchmark.py", line 285, in r_benchmark
result, output = self._get_benchmark_result(command)
File "/var/lib/buildkite-agent/builds/aws-arm64-t4g-linux-compute-i-0d98ea5af31709bb4-1/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/benchmarks/benchmarks/_benchmark.py", line 320, in _get_benchmark_result
raise Exception(error)
Exception: sh: line 1: 27111 Killed /var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/R/bin/R --no-save -s < '/tmp/Rtmp6Kv4nR/file69cb61a1309d' 2>&1
[230612-02:07:44.689] [7491] [root] ERROR: {"timestamp": "2023-06-12T02:07:44.688751+00:00", "tags": {"cpu_count": null, "engine": "arrow", "memory_map": false, "query_id": "TPCH-21", "scale_factor": 10, "format": "parquet", "language": "R", "name": "tpch"}, "info": {"arrow_version": "13.0.0-SNAPSHOT", "arrow_compiler_id": "GNU", "arrow_compiler_version": "11.4.0", "benchmark_language_version": "R version 4.2.3 (2023-03-15)", "arrow_version_r": "12.0.0.9000"}, "context": {"arrow_compiler_flags": "-fvisibility-inlines-hidden -fmessage-length=0 -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -isystem /var/lib/buildkite-agent/.conda/envs/arrow-commit/include -fdiagnostics-color=always", "benchmark_language": "R"}, "error": "sh: line 1: 27181 Killed /var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/R/bin/R --no-save -s < '/tmp/Rtmpw3CkLZ/file6a1053b0c973' 2>&1", "command": "library(arrowbench); run_one(tpc_h, cpu_count=NULL, format='parquet', scale_factor=10, engine='arrow', memory_map=FALSE, query_id=21)"}
Traceback (most recent call last):
File "/var/lib/buildkite-agent/builds/aws-arm64-t4g-linux-compute-i-0d98ea5af31709bb4-1/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/benchmarks/benchmarks/_benchmark.py", line 285, in r_benchmark
result, output = self._get_benchmark_result(command)
File "/var/lib/buildkite-agent/builds/aws-arm64-t4g-linux-compute-i-0d98ea5af31709bb4-1/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/benchmarks/benchmarks/_benchmark.py", line 320, in _get_benchmark_result
raise Exception(error)
Exception: sh: line 1: 27181 Killed /var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/R/bin/R --no-save -s < '/tmp/Rtmpw3CkLZ/file6a1053b0c973' 2>&1
This suggests to me that these specific machines (the T4 machines, which are the only ones failing with this issue right now) cannot handle that query at scale factor 10. They seem to run out of memory. That's a separate problem from this ticket.
But is that separate problem correlated with this problem? Interestingly, we see from your original post that one part of the logs that were attempted to be posted to arrow-bci.ursa.dev
was 'process_pid': 27111
. That's the same process that was killed! So perhaps the fact that the process was killed interfered with the posting of the logs in some way.
https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2827#0188ac67-e5f6-4afc-81a0-86a9293efad5/6-842
Then later in the build log:
post_logs_to_arrow_bci()
called? What's it supposed to do?FAILED R tpch
fail because of the TLS failure shown above?