Closed austin3dickey closed 1 year ago
More circumstantial evidence that this is semi-random 50x related: the ec2 jobs have a very small number of benchmarks that are run and haven't seen this either. Not surprising we are seeing more failures on jobs that post more times during their process
FAILED C++ cpp-micro 1 line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://conbench.ursa.dev//api/benchmarks/
Traceback (most recent call last):
File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
PASSED Python dataset-read 0:03:54.720217
PASSED Python dataset-select 0:00:04.690246
FAILED C++ cpp-micro 1 line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway
Updated the title to include 502s (see my original post).
For data-collecting purposes, here are all instances I could find from the last 24 hours. There are 21 total.
requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://conbench.ursa.dev//api/login/
https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2974#01886dee-0d2b-4812-a38b-de1eaf07d6d8/6-4984 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2975#01886e93-7be1-4d09-b2f3-978694557ce8/6-4984 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2979#0188715c-b97b-46f8-a4cc-6b92c473c170/6-4984 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2985#01886e78-9dae-43e9-b794-bfb6f5f9f5d4/6-9447 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2986#01886f23-5cee-4a4f-9230-590ed1c767d9/6-9677 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2751#01886d44-2abc-48d6-8789-322f0022c1d5/6-7531 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2767#01886d48-f6f5-4425-8687-e389a3b114ac/6-7641
requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://conbench.ursa.dev//api/benchmarks/
https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2987#01886fb8-bd27-406d-bf55-e52e6dcdfcd1/6-9810 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2988#01887053-52f4-4a95-a6bc-73f454342c75/6-9851 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2990#01887178-6f4a-4c16-ad89-41281937c550/6-9851 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2758#01886fee-aee6-4808-bd1f-a088bc6f136c/6-7626 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2766#01886d42-da79-4488-91b7-fe7782fa71c7/6-7767 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2770#01886f41-a207-4d47-9d11-8565743befef/6-7756 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2771#01886f41-a95c-4323-b4ae-6d7abc7526c4/6-7746
requests.exceptions.RetryError: HTTPSConnectionPool(host='conbench.ursa.dev', port=443): Max retries exceeded with url: /api/benchmarks/ (Caused by ResponseError('too many 503 error responses'))
https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/builds/2311#01886dee-21dc-42bc-a4e5-f15f87d02932/6-5425 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/builds/2312#01886df7-be0b-4442-a684-c16ff5038d68/6-4900 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/builds/2313#01886e52-8916-41ea-8d6e-625dd1d27dd3/6-7322 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/builds/2314#01886e52-8672-43f1-98d5-7f8465d8e222/6-7460 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2753#01886e40-3e30-45aa-8034-6c3c2f120526/6-4961 https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2754#01886e52-77f7-48e6-ae2a-ba99ea3108c9/6-4144
requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://conbench.ursa.dev//api/benchmarks/
I just went through all the build logs of the last 24 hours again.
There were 0 (zero) instances of these symptoms!
I think that's acceptable enough to close this issue if you all agree.
Since the last time I posted (about 67 hours ago), we saw 14 more jobs fail with 503-related errors.
I see: 10 counts of
requests.exceptions.RetryError: HTTPSConnectionPool(host='conbench.ursa.dev', port=443): Max retries exceeded with url: /api/benchmarks/ (Caused by ResponseError('too many 503 error responses'))
10 counts of
requests.exceptions.RetryError: HTTPSConnectionPool(host='conbench.ursa.dev', port=443): Max retries exceeded with url: /api/login/ (Caused by ResponseError('too many 503 error responses'))
5 counts of
benchclients.http.RetryingHTTPClientDeadlineReached: POST request to https://conbench.ursa.dev/api/login/: giving up after ~1770 s
and 2 jobs (1, 2) completely timed out after about 6 hours (though they weren't using RetryingHTTPClient benchmarks).
By the way: looks like RetryingHTTPClient was doing its job the best it could, for the cpp-micro benchmarks. Here's a snippet from this log:
[230603-02:08:56.622] [358880] [benchclients.http] INFO: cycle 28 failed, wait for 60.0 s, deadline in 8.5 min
[230603-02:09:56.830] [358880] [benchclients.http] INFO: POST request to https://conbench.ursa.dev/api/login/: took 0.1953 s, response status code: 503
[230603-02:09:56.830] [358880] [benchclients.http] INFO: unexpected response. code: 503 (retryable), body bytes: <<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body>
<center><h1>503 Service Temporarily Unavailable</h1></center>
</body>
</html>
...>
I have not seen this symptom for a few days now, thanks to many improvements.
The benchmark runs have been unhealthy for about a week:
Taking a sample of logs, it looks like these failures are due to 502s, 503s, or 504s when POSTing benchmark results to Conbench. That does not happen every time we try to POST results, but it happens often enough to fail most builds on aggregate. Here are some examples.
https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2947
https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2765
https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2971