voltrondata-labs / arrow-benchmarks-ci

Benchmarks CI for Apache Arrow project
MIT License
0 stars 5 forks source link

Failed to post to logs: TLS/SSL connection has been closed (EOF) (_ssl.c:1131)'))) #133

Open jgehrcke opened 1 year ago

jgehrcke commented 1 year ago

https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2827#0188ac67-e5f6-4afc-81a0-86a9293efad5/6-842

[230611-23:57:22.275] [30266] [buildkite.benchmark.run] INFO: start child process: conbench tpch --iterations=3 --all=true --drop-caches=true --run-id=$RUN_ID --run-name="$RUN_NAME" --run-reason="$RUN_REASON"
ERROR:root:Failed to post to logs: {'type': 'MemoryUsage', 'id': 'fd556663668f44b08e8af356df10aa9d', 'benchmark_group_execution_id': '08cf72a35485413e813ea76205778665', 'process_pid': 27111, 'parent_process_pid': 30266, 'process_name': 'R', 'process_cmdline': ['/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/R/bin/exec/R', '--no-save', '-s'], 'mem_rss_bytes': 32255574016, 'mem_percent': 96.50920078786864}
Traceback (most recent call last):

...
Traceback (most recent call last):
  File "/var/lib/buildkite-agent/builds/aws-arm64-t4g-linux-compute-i-0d98ea5af31709bb4-1/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/buildkite/benchmark/run_utils.py", line 53, in post_logs_to_arrow_bci
    requests.post(
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/requests/adapters.py", line 517, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='arrow-bci.ursa.dev', port=443): Max retries exceeded with url: /logs (Caused by SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1131)')))
[230612-02:12:10.264] [30266] [buildkite.benchmark.run] INFO: child process exited with code 0.

Then later in the build log:

FAILED R tpch 0 bench.ursa.dev/api/benchmark-results/
[230612-02:12:09.979] [7491] [benchclients.http] INFO: POST request to https://conbench.ursa.dev/api/benchmark-results/: took 0.8682 s, response status code: 201
Traceback (most recent call last):
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/var/lib/buildkite-agent/builds/aws-arm64-t4g-linux-compute-i-0d98ea5af31709bb4-1/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/buildkite/benchmark/run_benchmark_groups.py", line 13, in <module>
    run_benchmark_groups()
  File "/var/lib/buildkite-agent/builds/aws-arm64-t4g-linux-compute-i-0d98ea5af31709bb4-1/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/buildkite/benchmark/run_benchmark_groups.py", line 10, in run_benchmark_groups
    run.run_all_benchmark_groups()
  File "/var/lib/buildkite-agent/builds/aws-arm64-t4g-linux-compute-i-0d98ea5af31709bb4-1/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/buildkite/benchmark/run.py", line 695, in run_all_benchmark_groups
    raise Exception("Build has failed benchmarks.")
Exception: Build has failed benchmarks.
austin3dickey commented 1 year ago

Did FAILED R tpch fail because of the TLS failure shown above?

I believe the answer is no for two reasons. First, the Failed to post to log string only appears in this codebase at the end of this function:

https://github.com/voltrondata-labs/arrow-benchmarks-ci/blob/03717c4d8da15f3d77bd1e74f0781f8fce80e9eb/buildkite/benchmark/run_utils.py#L48-L63

That function looks like it should never raise an exception; it should just log any exceptions that occur. Though of course, due to #134, that doesn't necessarily mean that "not raising an exception" will turn into "benchmarks are marked as successful". So that brings me to my second point.

Second, I think I found the "true" benchmark failures by searching "error": in the build log. (That is, I was effectively searching through the posted JSON for any benchmark result payloads that contain the "error" key.) For the TPCH-21 query at scale factor 10, both native and parquet formats, we see these logs:

[230612-02:07:08.497] [7491] [root] ERROR: {"timestamp": "2023-06-12T02:07:08.496816+00:00", "tags": {"cpu_count": null, "engine": "arrow", "memory_map": false, "query_id": "TPCH-21", "scale_factor": 10, "format": "native", "language": "R", "name": "tpch"}, "info": {"arrow_version": "13.0.0-SNAPSHOT", "arrow_compiler_id": "GNU", "arrow_compiler_version": "11.4.0", "benchmark_language_version": "R version 4.2.3 (2023-03-15)", "arrow_version_r": "12.0.0.9000"}, "context": {"arrow_compiler_flags": "-fvisibility-inlines-hidden -fmessage-length=0 -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -isystem /var/lib/buildkite-agent/.conda/envs/arrow-commit/include -fdiagnostics-color=always", "benchmark_language": "R"}, "error": "sh: line 1: 27111 Killed                  /var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/R/bin/R --no-save -s < '/tmp/Rtmp6Kv4nR/file69cb61a1309d' 2>&1", "command": "library(arrowbench); run_one(tpc_h, cpu_count=NULL, format='native', scale_factor=10, engine='arrow', memory_map=FALSE, query_id=21)"}
Traceback (most recent call last):
  File "/var/lib/buildkite-agent/builds/aws-arm64-t4g-linux-compute-i-0d98ea5af31709bb4-1/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/benchmarks/benchmarks/_benchmark.py", line 285, in r_benchmark
    result, output = self._get_benchmark_result(command)
  File "/var/lib/buildkite-agent/builds/aws-arm64-t4g-linux-compute-i-0d98ea5af31709bb4-1/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/benchmarks/benchmarks/_benchmark.py", line 320, in _get_benchmark_result
    raise Exception(error)
Exception: sh: line 1: 27111 Killed                  /var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/R/bin/R --no-save -s < '/tmp/Rtmp6Kv4nR/file69cb61a1309d' 2>&1
[230612-02:07:44.689] [7491] [root] ERROR: {"timestamp": "2023-06-12T02:07:44.688751+00:00", "tags": {"cpu_count": null, "engine": "arrow", "memory_map": false, "query_id": "TPCH-21", "scale_factor": 10, "format": "parquet", "language": "R", "name": "tpch"}, "info": {"arrow_version": "13.0.0-SNAPSHOT", "arrow_compiler_id": "GNU", "arrow_compiler_version": "11.4.0", "benchmark_language_version": "R version 4.2.3 (2023-03-15)", "arrow_version_r": "12.0.0.9000"}, "context": {"arrow_compiler_flags": "-fvisibility-inlines-hidden -fmessage-length=0 -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -isystem /var/lib/buildkite-agent/.conda/envs/arrow-commit/include -fdiagnostics-color=always", "benchmark_language": "R"}, "error": "sh: line 1: 27181 Killed                  /var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/R/bin/R --no-save -s < '/tmp/Rtmpw3CkLZ/file6a1053b0c973' 2>&1", "command": "library(arrowbench); run_one(tpc_h, cpu_count=NULL, format='parquet', scale_factor=10, engine='arrow', memory_map=FALSE, query_id=21)"}
Traceback (most recent call last):
  File "/var/lib/buildkite-agent/builds/aws-arm64-t4g-linux-compute-i-0d98ea5af31709bb4-1/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/benchmarks/benchmarks/_benchmark.py", line 285, in r_benchmark
    result, output = self._get_benchmark_result(command)
  File "/var/lib/buildkite-agent/builds/aws-arm64-t4g-linux-compute-i-0d98ea5af31709bb4-1/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/benchmarks/benchmarks/_benchmark.py", line 320, in _get_benchmark_result
    raise Exception(error)
Exception: sh: line 1: 27181 Killed                  /var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/R/bin/R --no-save -s < '/tmp/Rtmpw3CkLZ/file6a1053b0c973' 2>&1

This suggests to me that these specific machines (the T4 machines, which are the only ones failing with this issue right now) cannot handle that query at scale factor 10. They seem to run out of memory. That's a separate problem from this ticket.

But is that separate problem correlated with this problem? Interestingly, we see from your original post that one part of the logs that were attempted to be posted to arrow-bci.ursa.dev was 'process_pid': 27111. That's the same process that was killed! So perhaps the fact that the process was killed interfered with the posting of the logs in some way.

ElenaHenderson commented 1 year ago