voltrondata-labs / arrow-benchmarks-ci

Benchmarks CI for Apache Arrow project
MIT License
0 stars 5 forks source link

benchmark CI runs are failing due to 502, 503, or 504 when POSTing results to Conbench #116

Closed austin3dickey closed 1 year ago

austin3dickey commented 1 year ago

The benchmark runs have been unhealthy for about a week:

image

Taking a sample of logs, it looks like these failures are due to 502s, 503s, or 504s when POSTing benchmark results to Conbench. That does not happen every time we try to POST results, but it happens often enough to fail most builds on aggregate. Here are some examples.

https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2947

[230530-05:10:59.364] [1753420] [benchmarks.dataset_serialize_benchmark] INFO: stdout of ['du', '-sh', '/dev/shm/bench-d9ff104b/10pc-csv-592d24e8-9c88-4fda-962b-8d107015aca3']: 779M
[230530-05:10:59.364] [1753420] [benchmarks.dataset_serialize_benchmark] INFO: removing directory: /dev/shm/bench-d9ff104b/10pc-csv-592d24e8-9c88-4fda-962b-8d107015aca3
[230530-05:10:59.461] [1753420] [benchmarks.dataset_serialize_benchmark] INFO: case ('100pc', 'parquet'): create directory
[230530-05:10:59.461] [1753420] [benchmarks.dataset_serialize_benchmark] INFO: directory created, path: /dev/shm/bench-d9ff104b/100pc-parquet-58b6a013-5ecc-4361-96ad-de53d72965f0
[230530-05:10:59.461] [1753420] [benchmarks.dataset_serialize_benchmark] INFO: read complete dataset nyctaxi_multi_ipc_s3 into memory
[230530-05:11:01.254] [1753420] [benchmarks.dataset_serialize_benchmark] INFO: read source dataset into memory in 1.7933 s
[230530-05:15:10.258] [1753420] [urllib3.util.retry] DEBUG: Incremented Retry for (url='/api/benchmarks/'): Retry(total=4, connect=None, read=None, redirect=None, status=None)
[230530-05:16:10.299] [1753420] [urllib3.util.retry] DEBUG: Incremented Retry for (url='/api/benchmarks/'): Retry(total=3, connect=None, read=None, redirect=None, status=None)
[230530-05:17:18.349] [1753420] [urllib3.util.retry] DEBUG: Incremented Retry for (url='/api/benchmarks/'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
[230530-05:18:34.406] [1753420] [urllib3.util.retry] DEBUG: Incremented Retry for (url='/api/benchmarks/'): Retry(total=1, connect=None, read=None, redirect=None, status=None)
[230530-05:20:01.305] [1753420] [urllib3.util.retry] DEBUG: Incremented Retry for (url='/api/benchmarks/'): Retry(total=0, connect=None, read=None, redirect=None, status=None)
[230530-05:21:05.575] [1753420] [root] ERROR: {"timestamp": "2023-05-30T10:21:05.575806+00:00", "tags": {"dataset": "nyctaxi_multi_ipc_s3", "cpu_count": null, "selectivity": "100pc", "format": "parquet", "name": "dataset-serialize"}, "info": {"arrow_version": "13.0.0-SNAPSHOT", "arrow_compiler_id": "GNU", "arrow_compiler_version": "11.3.0", "benchmark_language_version": "Python 3.8.16"}, "context": {"arrow_compiler_flags": "-fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /var/lib/buildkite-agent/miniconda3/envs/arrow-commit/include -fdiagnostics-color=always", "benchmark_language": "Python"}, "error": "HTTPSConnectionPool(host='conbench.ursa.dev', port=443): Max retries exceeded with url: /api/benchmarks/ (Caused by ResponseError('too many 503 error responses'))"}

https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2765

[230530-10:16:33.713] [23776] [benchadapt.adapters] INFO: Results transformation completed
[230530-10:16:33.714] [23776] [benchadapt.adapters] INFO: Initializing conbench client
[230530-10:16:33.714] [23776] [benchclients.logging] DEBUG: POST https://conbench.ursa.dev//api/login/ {"email": "arm64-m6g-linux-compute@arrow-bci.com", "password": "[REDACTED]"}
[230530-10:16:52.808] [23776] [benchadapt.adapters] INFO: Posting results to conbench
[230530-10:16:52.808] [23776] [benchclients.logging] DEBUG: POST https://conbench.ursa.dev//api/benchmarks/ {"run_name": "commit: 431785f3062199b2b9052902b67492b933744833", "run_id": "6aa006223557414599304185193f479b", "batch_id": "62d92f7eca7b4a369ffad709103b1968", "run_reason": "commit", "timestamp": "2023-05-30T10:08:06.728376+00:00", "stats": {"data": [1136662006.0442507], "unit": "B/s", "times": [922476.1200527315], "time_unit": "ns", "iterations": 1}, "tags": {"params": "1048576/0", "name": "FilterFSLInt64FilterNoNulls", "suite": "arrow-compute-vector-selection-benchmark", "source": "cpp-micro"}, "info": {"arrow_version": "13.0.0-SNAPSHOT", "arrow_compiler_id": "GNU", "arrow_compiler_version": "11.3.0"}, "machine_info": {"name": "arm64-m6g-linux-compute", "os_name": "Linux", "os_version": "4.14.248-189.473.amzn2.aarch64-aarch64-with-glibc2.17", "architecture_name": "aarch64", "kernel_name": "4.14.248-189.473.amzn2.aarch64", "memory_bytes": "64424509440", "cpu_model_name": "Neoverse-N1", "cpu_core_count": "16", "cpu_thread_count": "16", "cpu_l1d_cache_bytes": "65536", "cpu_l1i_cache_bytes": "65536", "cpu_l2_cache_bytes": "1048576", "cpu_l3_cache_bytes": "33554432", "cpu_frequency_max_hz": "0", "gpu_count": "0", "gpu_product_names": []}, "context": {"benchmark_language": "C++", "arrow_compiler_flags": "-fvisibility-inlines-hidden -fmessage-length=0 -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -isystem /var/lib/buildkite-agent/.conda/envs/arrow-commit/include -fdiagnostics-color=always"}, "github": {"repository": "https://github.com/apache/arrow", "pr_number": null, "commit": "431785f3062199b2b9052902b67492b933744833"}}
[230530-10:17:52.813] [23776] [benchclients.logging] ERROR: Failed request: POST https://conbench.ursa.dev//api/benchmarks/ {"run_name": "commit: 431785f3062199b2b9052902b67492b933744833", "run_id": "6aa006223557414599304185193f479b", "batch_id": "62d92f7eca7b4a369ffad709103b1968", "run_reason": "commit", "timestamp": "2023-05-30T10:08:06.728376+00:00", "stats": {"data": [1136662006.0442507], "unit": "B/s", "times": [922476.1200527315], "time_unit": "ns", "iterations": 1}, "tags": {"params": "1048576/0", "name": "FilterFSLInt64FilterNoNulls", "suite": "arrow-compute-vector-selection-benchmark", "source": "cpp-micro"}, "info": {"arrow_version": "13.0.0-SNAPSHOT", "arrow_compiler_id": "GNU", "arrow_compiler_version": "11.3.0"}, "machine_info": {"name": "arm64-m6g-linux-compute", "os_name": "Linux", "os_version": "4.14.248-189.473.amzn2.aarch64-aarch64-with-glibc2.17", "architecture_name": "aarch64", "kernel_name": "4.14.248-189.473.amzn2.aarch64", "memory_bytes": "64424509440", "cpu_model_name": "Neoverse-N1", "cpu_core_count": "16", "cpu_thread_count": "16", "cpu_l1d_cache_bytes": "65536", "cpu_l1i_cache_bytes": "65536", "cpu_l2_cache_bytes": "1048576", "cpu_l3_cache_bytes": "33554432", "cpu_frequency_max_hz": "0", "gpu_count": "0", "gpu_product_names": []}, "context": {"benchmark_language": "C++", "arrow_compiler_flags": "-fvisibility-inlines-hidden -fmessage-length=0 -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -isystem /var/lib/buildkite-agent/.conda/envs/arrow-commit/include -fdiagnostics-color=always"}, "github": {"repository": "https://github.com/apache/arrow", "pr_number": null, "commit": "431785f3062199b2b9052902b67492b933744833"}}
[230530-10:17:52.814] [23776] [benchclients.logging] ERROR: Response content: <html>

<head><title>504 Gateway Time-out</title></head>

<body>

<center><h1>504 Gateway Time-out</h1></center>

</body>

</html>

https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2971

[230529-14:37:55.650] [1566763] [benchclients.logging] DEBUG: POST https://conbench.ursa.dev//api/benchmarks/ {"run_name": "commit: 1951a1ae69590ad58d97f6be929fa14485f81f42", "run_id": "a12fa29ff1a0497ab697dba114d292d9", "batch_id": "998ec7728b43463494eb2b28a694e275", "run_reason": "commit", "timestamp": "2023-05-29T19:34:13.574195+00:00", "stats": {"data": [1747369182.4928334], "unit": "B/s", "times": [300071.70600296854], "time_unit": "ns", "iterations": 1}, "tags": {"params": "<Round, FloatType, RoundMode::DOWN>/size:524288/inverse_null_proportion:0", "name": "RoundArrayBenchmark", "suite": "arrow-compute-scalar-round-benchmark", "source": "cpp-micro"}, "info": {"arrow_version": "13.0.0-SNAPSHOT", "arrow_compiler_id": "GNU", "arrow_compiler_version": "11.3.0"}, "machine_info": {"name": "ursa-thinkcentre-m75q", "os_name": "Linux", "os_version": "5.15.0-71-generic-x86_64-with-glibc2.10", "architecture_name": "x86_64", "kernel_name": "5.15.0-71-generic", "memory_bytes": "16106127360", "cpu_model_name": "AMD Ryzen 5 PRO 4650GE with Radeon Graphics", "cpu_core_count": "6", "cpu_thread_count": "6", "cpu_l1d_cache_bytes": "196608", "cpu_l1i_cache_bytes": "196608", "cpu_l2_cache_bytes": "3145728", "cpu_l3_cache_bytes": "8388608", "cpu_frequency_max_hz": "3300000000", "gpu_count": "0", "gpu_product_names": []}, "context": {"benchmark_language": "C++", "arrow_compiler_flags": "-fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /var/lib/buildkite-agent/miniconda3/envs/arrow-commit/include -fdiagnostics-color=always"}, "github": {"repository": "https://github.com/apache/arrow", "pr_number": null, "commit": "1951a1ae69590ad58d97f6be929fa14485f81f42"}}
[230529-14:37:55.876] [1566763] [benchclients.logging] ERROR: Failed request: POST https://conbench.ursa.dev//api/benchmarks/ {"run_name": "commit: 1951a1ae69590ad58d97f6be929fa14485f81f42", "run_id": "a12fa29ff1a0497ab697dba114d292d9", "batch_id": "998ec7728b43463494eb2b28a694e275", "run_reason": "commit", "timestamp": "2023-05-29T19:34:13.574195+00:00", "stats": {"data": [1747369182.4928334], "unit": "B/s", "times": [300071.70600296854], "time_unit": "ns", "iterations": 1}, "tags": {"params": "<Round, FloatType, RoundMode::DOWN>/size:524288/inverse_null_proportion:0", "name": "RoundArrayBenchmark", "suite": "arrow-compute-scalar-round-benchmark", "source": "cpp-micro"}, "info": {"arrow_version": "13.0.0-SNAPSHOT", "arrow_compiler_id": "GNU", "arrow_compiler_version": "11.3.0"}, "machine_info": {"name": "ursa-thinkcentre-m75q", "os_name": "Linux", "os_version": "5.15.0-71-generic-x86_64-with-glibc2.10", "architecture_name": "x86_64", "kernel_name": "5.15.0-71-generic", "memory_bytes": "16106127360", "cpu_model_name": "AMD Ryzen 5 PRO 4650GE with Radeon Graphics", "cpu_core_count": "6", "cpu_thread_count": "6", "cpu_l1d_cache_bytes": "196608", "cpu_l1i_cache_bytes": "196608", "cpu_l2_cache_bytes": "3145728", "cpu_l3_cache_bytes": "8388608", "cpu_frequency_max_hz": "3300000000", "gpu_count": "0", "gpu_product_names": []}, "context": {"benchmark_language": "C++", "arrow_compiler_flags": "-fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /var/lib/buildkite-agent/miniconda3/envs/arrow-commit/include -fdiagnostics-color=always"}, "github": {"repository": "https://github.com/apache/arrow", "pr_number": null, "commit": "1951a1ae69590ad58d97f6be929fa14485f81f42"}}
[230529-14:37:55.876] [1566763] [benchclients.logging] ERROR: Response content: <html>

<head><title>502 Bad Gateway</title></head>

<body>

<center><h1>502 Bad Gateway</h1></center>

</body>

</html>
jonkeane commented 1 year ago

More circumstantial evidence that this is semi-random 50x related: the ec2 jobs have a very small number of benchmarks that are run and haven't seen this either. Not surprising we are seeing more failures on jobs that post more times during their process

jgehrcke commented 1 year ago
FAILED C++ cpp-micro 1 line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://conbench.ursa.dev//api/benchmarks/
Traceback (most recent call last):
  File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,

https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2988#01887053-52f4-4a95-a6bc-73f454342c75/6-9849

PASSED Python dataset-read 0:03:54.720217
PASSED Python dataset-select 0:00:04.690246
FAILED C++ cpp-micro 1 line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway

(https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2758#01886fee-aee6-4808-bd1f-a088bc6f136c/6-7624)

austin3dickey commented 1 year ago

Updated the title to include 502s (see my original post).

austin3dickey commented 1 year ago

For data-collecting purposes, here are all instances I could find from the last 24 hours. There are 21 total.

requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://conbench.ursa.dev//api/login/

https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2974#01886dee-0d2b-4812-a38b-de1eaf07d6d8/6-4984 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2975#01886e93-7be1-4d09-b2f3-978694557ce8/6-4984 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2979#0188715c-b97b-46f8-a4cc-6b92c473c170/6-4984 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2985#01886e78-9dae-43e9-b794-bfb6f5f9f5d4/6-9447 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2986#01886f23-5cee-4a4f-9230-590ed1c767d9/6-9677 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2751#01886d44-2abc-48d6-8789-322f0022c1d5/6-7531 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2767#01886d48-f6f5-4425-8687-e389a3b114ac/6-7641

requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://conbench.ursa.dev//api/benchmarks/

https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2987#01886fb8-bd27-406d-bf55-e52e6dcdfcd1/6-9810 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2988#01887053-52f4-4a95-a6bc-73f454342c75/6-9851 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2990#01887178-6f4a-4c16-ad89-41281937c550/6-9851 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2758#01886fee-aee6-4808-bd1f-a088bc6f136c/6-7626 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2766#01886d42-da79-4488-91b7-fe7782fa71c7/6-7767 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2770#01886f41-a207-4d47-9d11-8565743befef/6-7756 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2771#01886f41-a95c-4323-b4ae-6d7abc7526c4/6-7746

requests.exceptions.RetryError: HTTPSConnectionPool(host='conbench.ursa.dev', port=443): Max retries exceeded with url: /api/benchmarks/ (Caused by ResponseError('too many 503 error responses'))

https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/builds/2311#01886dee-21dc-42bc-a4e5-f15f87d02932/6-5425 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/builds/2312#01886df7-be0b-4442-a684-c16ff5038d68/6-4900 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/builds/2313#01886e52-8916-41ea-8d6e-625dd1d27dd3/6-7322 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/builds/2314#01886e52-8672-43f1-98d5-7f8465d8e222/6-7460 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2753#01886e40-3e30-45aa-8034-6c3c2f120526/6-4961 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2754#01886e52-77f7-48e6-ae2a-ba99ea3108c9/6-4144

requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://conbench.ursa.dev//api/benchmarks/

https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2754#01886e52-77f7-48e6-ae2a-ba99ea3108c9/6-4142

austin3dickey commented 1 year ago

I just went through all the build logs of the last 24 hours again.

There were 0 (zero) instances of these symptoms!

I think that's acceptable enough to close this issue if you all agree.

austin3dickey commented 1 year ago

Since the last time I posted (about 67 hours ago), we saw 14 more jobs fail with 503-related errors.

All Buildkite links here. https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2794#01887db8-03fa-4b0f-b5c1-3ef5b2f078dd https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2795#01888195-bd82-40f3-ad34-a6cb9317299e https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2798#01888a9e-a42e-4a4f-856d-186c4ce84c78 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2779#01887db8-038b-441d-8ffb-07aed434a950 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2780#01888195-ef5e-4c45-a60d-ae343a7f501f https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/builds/2338#01888195-f642-422c-88a2-08ac5e455494 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/builds/2339#01888313-49b1-4281-96cc-c71132fbdd5d https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2976#01888193-51a4-4e89-ab11-5971bf1ed1cf https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/3012#01888193-7f77-4d01-ac20-1d9252139d0b https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2974#01887fd8-4cdb-435c-b8a5-9ba1781a8ac9 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2975#0188812b-f333-4964-bb65-8e9ecd26d3c6 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2976#0188827e-6cab-4967-9db8-4eebf8299bac https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/3001#01887f98-72b4-4ee3-8452-61ea11b0af64 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/3002#01888193-592b-4e7c-83dc-93f2ac25491b

I see: 10 counts of

requests.exceptions.RetryError: HTTPSConnectionPool(host='conbench.ursa.dev', port=443): Max retries exceeded with url: /api/benchmarks/ (Caused by ResponseError('too many 503 error responses'))

10 counts of

requests.exceptions.RetryError: HTTPSConnectionPool(host='conbench.ursa.dev', port=443): Max retries exceeded with url: /api/login/ (Caused by ResponseError('too many 503 error responses'))

5 counts of

benchclients.http.RetryingHTTPClientDeadlineReached: POST request to https://conbench.ursa.dev/api/login/: giving up after ~1770 s

and 2 jobs (1, 2) completely timed out after about 6 hours (though they weren't using RetryingHTTPClient benchmarks).


By the way: looks like RetryingHTTPClient was doing its job the best it could, for the cpp-micro benchmarks. Here's a snippet from this log:

[230603-02:08:56.622] [358880] [benchclients.http] INFO: cycle 28 failed, wait for 60.0 s, deadline in 8.5 min
[230603-02:09:56.830] [358880] [benchclients.http] INFO: POST request to https://conbench.ursa.dev/api/login/: took 0.1953 s, response status code: 503
[230603-02:09:56.830] [358880] [benchclients.http] INFO: unexpected response. code: 503 (retryable), body bytes: <<html>

<head><title>503 Service Temporarily Unavailable</title></head>

<body>

<center><h1>503 Service Temporarily Unavailable</h1></center>

</body>

</html>

 ...>
austin3dickey commented 1 year ago

I have not seen this symptom for a few days now, thanks to many improvements.