openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
99 stars 12 forks source link

Minutely diffs occasionally return errors #992

Closed lonvia closed 1 year ago

lonvia commented 1 year ago

Since the move to AWS, I've now see twice that pyosmium reports that a minutely diff is broken:

023-11-16 10:52:29: Using project directory: /srv/nominatim.openstreetmap.org/planet-project
Traceback (most recent call last):
  File "/usr/local/bin/nominatim", line 12, in <module>
    exit(cli.nominatim(module_dir='/usr/local/lib/nominatim/module',
  File "/usr/local/lib/nominatim/lib-python/nominatim/cli.py", line 226, in nominatim
    return get_set_parser().run(**kwargs)
  File "/usr/local/lib/nominatim/lib-python/nominatim/cli.py", line 121, in run
    return args.command.run(args)
  File "/usr/local/lib/nominatim/lib-python/nominatim/clicmd/replication.py", line 199, in run
    self._update(args)
  File "/usr/local/lib/nominatim/lib-python/nominatim/clicmd/replication.py", line 154, in _update
    state = replication.update(dsn, params, socket_timeout=args.socket_timeout)
  File "/usr/local/lib/nominatim/lib-python/nominatim/tools/replication.py", line 128, in update
    endseq = repl.apply_diffs(outhandler, startseq + 1,
  File "/usr/lib/python3/dist-packages/osmium/replication/server.py", line 171, in apply_diffs
    diffs = self.collect_diffs(start_id, max_size)
  File "/usr/lib/python3/dist-packages/osmium/replication/server.py", line 137, in collect_diffs
    left_size -= rd.add_buffer(diffdata, self.diff_type)
RuntimeError: gzip error: inflate failed: incorrect header check

I presume that means that the file was downloaded partially. Restart the process and the download is fine.

pyosmium assumes that the global state.txt file is written only after the actual diffs are published. Is this still guaranteed with the AWS files?

tomhughes commented 1 year ago

It should be, yes - we publish that last after the actual diff has been published.

Firefishy commented 1 year ago

AWS S3 object copies are atomic operations up to 5GB with read-after-write consistency.

I am currently unsure where the issue might be.

Firefishy commented 1 year ago

I will download the S3 event logs and investigate.

Firefishy commented 1 year ago

The issue was caused by AWS S3 returning a 500 InternalError.

0f474499af5eca52a75b973ff124d6c5de371cb82e51ed089ac52d08b49a14a2 osm-planet-eu-central-1 [16/Nov/2023:10:52:35 +0000] 2605:bc80:3010:700::8cd3:a764 - ZH16QHVG5WA39ZSW REST.GET.OBJECT planet/replication/minute/005/836/144.osc.gz "GET /planet/replication/minute/005/836/144.osc.gz HTTP/1.1" 500 InternalError 282 - 2090 - "-" "Nominatim (pyosmium/3.5.0)" - ftbR6xrHyxr/l7nUEAOJuhB7HC3zyR/TQVOPNupr/SOrSr+kxb5YW1xkrCqbYa3AXfUit796pD8= - ECDHE-RSA-AES128-GCM-SHA256 - [osm-planet-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com](http://osm-planet-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com/) TLSv1.2 - -

0.00182767% of requests are getting 5xx errors from AWS S3 in a few hour sample period. AWS has some documentation: https://repost.aws/knowledge-center/http-5xx-errors-s3

It is likely I may need to open a support ticket with AWS to investigate the issues.

tomhughes commented 1 year ago

I seriously doubt you'll get anything useful from support anyway - that all reads very much like "yeah random noise errors happen occassionally, just retry".

Firefishy commented 1 year ago

I have opened a support request with AWS.

Firefishy commented 1 year ago

"It's a best practice to build retry logic into applications that make requests to Amazon S3." - AWS tech support.

pnorman commented 1 year ago

Although elevated 500 error rates are an issue, they are right in that retires are a best practice.

Do we have metrics on the 5xx errors?

Firefishy commented 1 year ago

Although elevated 500 error rates are an issue, they are right in that retires are a best practice.

Do we have metrics on the 5xx errors?

Yes: https://prometheus.openstreetmap.org/d/c6fba6cd-e8f0-4c3e-8a62-7eafbf92cd71/amazon-s3?orgId=1&refresh=5m&var-name=arn:aws:s3:::osm-planet-eu-central-1&viewPanel=2&from=now-30d&to=now A few a day.

lonvia commented 1 year ago

I've published a new release of pyosmium that fixes the handling of non-200 responses and adds a transparent retry for 500 and 503. That should fix things for me. Feel free to close.

pnorman commented 1 year ago

On a 1h averaging period, the highest 5xx error rate was 0.014%. Typical is 0%, and it goes up to 0.002% fairly often. AWS' SLA for S3 Intelligent Tiering is having 99.0% success rate over the month, calculated as 5 minute intervals. We're at 0.00017%, which is well within SLA. 99.9998 is as good as we can reasonably expect. This means a typical server fetching one state file and one diff every minute will see 1-2 errors per year.

S3 Standard is 99.9%, but I doubt it matters here, since we're at almost 6 nines.

Note: Query for checking SLA is avg_over_time(sum((aws_s3_5xx_errors_sum{name=~"arn:aws:s3:::osm-planet-eu-central-1"} / aws_s3_get_requests_sum{name=~"arn:aws:s3:::osm-planet-eu-central-1"})) by (name) [$__range:]) as an instant

lonvia commented 1 year ago

pyosmium 3.7 is now available in Debian testing. Can we port it to our ppa?

tomhughes commented 1 year ago

Initial attempt failed as it needs a newer libosmium so I'm backporting that now.

A bigger problem is that it needs python3-pytest-httpserver which isn't available in Ubuntu until 22.10 and I don't really want to be trying to backport that...

lonvia commented 1 year ago

This is only a test dependency. I don't suppose you would want to disable tests?

tomhughes commented 1 year ago

I thought it was part of a larger pytest package but I realised it's actually completely separate so I'm trying a backport from 22.10 to see if that works.

tomhughes commented 1 year ago

Backport is now complete.