Closed lonvia closed 1 year ago
It should be, yes - we publish that last after the actual diff has been published.
AWS S3 object copies are atomic operations up to 5GB with read-after-write consistency.
I am currently unsure where the issue might be.
I will download the S3 event logs and investigate.
The issue was caused by AWS S3 returning a 500 InternalError.
0f474499af5eca52a75b973ff124d6c5de371cb82e51ed089ac52d08b49a14a2 osm-planet-eu-central-1 [16/Nov/2023:10:52:35 +0000] 2605:bc80:3010:700::8cd3:a764 - ZH16QHVG5WA39ZSW REST.GET.OBJECT planet/replication/minute/005/836/144.osc.gz "GET /planet/replication/minute/005/836/144.osc.gz HTTP/1.1" 500 InternalError 282 - 2090 - "-" "Nominatim (pyosmium/3.5.0)" - ftbR6xrHyxr/l7nUEAOJuhB7HC3zyR/TQVOPNupr/SOrSr+kxb5YW1xkrCqbYa3AXfUit796pD8= - ECDHE-RSA-AES128-GCM-SHA256 - [osm-planet-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com](http://osm-planet-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com/) TLSv1.2 - -
0.00182767% of requests are getting 5xx errors from AWS S3 in a few hour sample period. AWS has some documentation: https://repost.aws/knowledge-center/http-5xx-errors-s3
It is likely I may need to open a support ticket with AWS to investigate the issues.
I seriously doubt you'll get anything useful from support anyway - that all reads very much like "yeah random noise errors happen occassionally, just retry".
I have opened a support request with AWS.
"It's a best practice to build retry logic into applications that make requests to Amazon S3." - AWS tech support.
Although elevated 500 error rates are an issue, they are right in that retires are a best practice.
Do we have metrics on the 5xx errors?
Although elevated 500 error rates are an issue, they are right in that retires are a best practice.
Do we have metrics on the 5xx errors?
I've published a new release of pyosmium that fixes the handling of non-200 responses and adds a transparent retry for 500 and 503. That should fix things for me. Feel free to close.
On a 1h averaging period, the highest 5xx error rate was 0.014%. Typical is 0%, and it goes up to 0.002% fairly often. AWS' SLA for S3 Intelligent Tiering is having 99.0% success rate over the month, calculated as 5 minute intervals. We're at 0.00017%, which is well within SLA. 99.9998 is as good as we can reasonably expect. This means a typical server fetching one state file and one diff every minute will see 1-2 errors per year.
S3 Standard is 99.9%, but I doubt it matters here, since we're at almost 6 nines.
Note: Query for checking SLA is avg_over_time(sum((aws_s3_5xx_errors_sum{name=~"arn:aws:s3:::osm-planet-eu-central-1"} / aws_s3_get_requests_sum{name=~"arn:aws:s3:::osm-planet-eu-central-1"})) by (name) [$__range:])
as an instant
pyosmium 3.7 is now available in Debian testing. Can we port it to our ppa?
Initial attempt failed as it needs a newer libosmium so I'm backporting that now.
A bigger problem is that it needs python3-pytest-httpserver
which isn't available in Ubuntu until 22.10 and I don't really want to be trying to backport that...
This is only a test dependency. I don't suppose you would want to disable tests?
I thought it was part of a larger pytest package but I realised it's actually completely separate so I'm trying a backport from 22.10 to see if that works.
Backport is now complete.
Since the move to AWS, I've now see twice that pyosmium reports that a minutely diff is broken:
I presume that means that the file was downloaded partially. Restart the process and the download is fine.
pyosmium assumes that the global
state.txt
file is written only after the actual diffs are published. Is this still guaranteed with the AWS files?