openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
98 stars 13 forks source link

Changeset upload performance since July 11th #1118

Closed mmd-osm closed 1 month ago

mmd-osm commented 3 months ago

I'm wondering why the changeset upload started to be significantly slower since July 11, while all other request types remained stable or even showed some improved response time.

d1

Also, the average response size shows an unexpected drop in the same time interval.

d2

To me it seems something is working in a somewhat different way now, though I have no idea where these differences are coming from. Does CF add some additional latency that would show up in the logs now?

tomhughes commented 3 months ago

Well I think you've answered your own question haven't you? It's almost certainly a side effect of the CDN being in front of it.

mmd-osm commented 3 months ago

For client side end-to-end measurements I would immediately agree to this. If I'm not mistaken, Prometheus is collecting response times from Apache log files, by which time we should be behind the CDN already.

What I'm really trying to rule out is that CGImap started behaving in unexpected ways due to the added CDN. If the additional times in Prometheus can be fully attributed to CF and not to some oddity in CGImap, then that's all good for me.

Firefishy commented 3 months ago

Some data about the Cloudflare proxy we still have enabled:

pnorman commented 3 months ago

The average response size is what I find odd. Could single API calls be being broken into chunks on the HTTP level and that change the responses?

Firefishy commented 3 months ago

Cloudflare does support transcoding origin traffic, which may be having an effect too.

Firefishy commented 3 months ago

We don't by default log the Content-Encoding response header, but I've enabled it briefly and it looks like many of the /api/ responses are using Brotli compression. Tiny number are using gzip. Some are uncompressed.

This might account for the increased response time and decreased response sizes.

mmd-osm commented 3 months ago

CGImap writes some extensive log information, including the total time spent for a changeset upload. If these numbers haven't changed much (or they don't match the numbers Prometheus is showing), this could also mean that Apache is doing some more work now.

Firefishy commented 3 months ago

Per cgimap logs:

Date/Time Total Response Time Requests Average Response Time Per Request Total Bytes Average Bytes per Request
11 July 2024 00:00 to 07:59 38640026 575229 67.1732927234197 117431152457 204146
18 July 2024 00:00 to 07:59 39596153 664252 59.610137417727 131793556948 198408

Apache is able to compress proxy responses.

mmd-osm commented 3 months ago

Thanks! Are these numbers filtered on changeset upload requests only? 60ms seems a bit low. We had about 100ms on average earlier on, according to Prometheus (like shown on the very first screenshot I posted above).

600k requests in 8 hours is also a bit high. I would expect at most 30k-60k requests, since we almost never see more than 2 requests/s for uploads (-> https://prometheus.openstreetmap.org/d/5rTT87FMk/web-site?orgId=1&from=1721260800000&to=1721289600000&viewPanel=12)

Firefishy commented 3 months ago

Thanks! Are these numbers filtered on changeset upload requests only? 60ms seems a bit low. We had about 100ms on average earlier on, according to Prometheus (like shown on the very first screenshot I posted above).

600k requests in 8 hours is also a bit high. I would expect at most 30k-60k requests, since we almost never see more than 2 requests/s for uploads (-> https://prometheus.openstreetmap.org/d/5rTT87FMk/web-site?orgId=1&from=1721260800000&to=1721289600000&viewPanel=12)

Sorry, I should have been clear. It is all cgimap responses from spike-06 during the time windows.

Firefishy commented 3 months ago

Per cgimap log data for changeset uploads from spike-06:

Date/Time Total Response Time Requests Average Response Time Per Request Total Bytes Average Bytes per Request
11 July 2024 00:00 to 07:59 2527646 5738 440.50 16965297 2956
18 July 2024 00:00 to 07:59 2867255 4663 614.89 17545566 3762

Apache is able to compress proxy responses.

mmd-osm commented 3 months ago

Thanks a lot. this looks much better. Do you happen to have some stats for July 9 or July 10, maybe? Performance degradation started on July 10th at around 21:00 UTC. It would very interesting to see some stats before that point.

Firefishy commented 3 months ago

I enabled the tiered caching around 21:00 UTC on 11th July 2024.

Firefishy commented 3 months ago
Screenshot 2024-07-18 at 14 24 13
Firefishy commented 3 months ago

Thanks a lot. this looks much better. Do you happen to have some stats for July 9 or July 10, maybe? Performance degradation started on July 10th at around 21:00 UTC. It would very interesting to see some stats before that point.

No. Earliest we currently have is July 11 from 00:00 UTC.

Firefishy commented 3 months ago

10th July 2024 @ 9pm UTC was the maintenance window. I have enabled options which add some overhead to RAM on the primary database server snap-01. The overhead is expected to be upto around 10%.

The server was also rebooted into a new kernel version which can affect performance.

mmd-osm commented 3 months ago

Good point. This would indicate that we're likely spending more time on database queries, and the slowdown is unrelated to CF.

The CGImap log files should include a breakdown per each SQL statement. Maybe there's a small number of statements in there that's responsible for a major part of the runtime. We could then take a closer look, what's going on there.

mmd-osm commented 3 months ago

I had a look at the snap-01 PostgreSQL stats. For some reason we're seeing significantly more sequential scans since the reboot (5/s instead of 0.0x/s)

image

https://prometheus.osm.org/d/77k_bqFMz/postgresql?orgId=1&refresh=1m&var-instance=snap-01&var-datname=All&from=now-30d&to=now&viewPanel=24

The following chart suggests it's oauth_tokens, which is kind of odd since it's used for oauth 1.0a.

image

https://prometheus.osm.org/d/ST-7bi5Gz/api-database?orgId=1&refresh=1m&var-instance=All&var-master=snap-01&var-slaves=All&from=now-30d&to=now&viewPanel=4

It's not clear if this is related to our issue here with cs uploads.

mmd-osm commented 3 months ago

One thing to try out is to restart all 3 CGImap instances on spike-0[6-8] and check if things are improving.

On July 10th, the db host has been switched twice (https://github.com/openstreetmap/chef/commit/f2e063ccae80b92c17712da609d8d74ffef08ba9 and https://github.com/openstreetmap/chef/commit/b7f67ceb48b5f0fc94e9a9db74ddcf4a98a50fe9), which might have left CGImap in some strange state.

Without seeing the log files, it's fairly impossible to figure out what's going on. So this is a bit of a wild guess only.

mmd-osm commented 2 months ago

After today's snap-01 restart, response times are slightly better again. Are we experiencing some issue with the storage, e.g reduced write I/O, that is impacting heavy insert activities on the db? If so, i would close this issue, since the root cause would be known.

Firefishy commented 1 month ago

Is this ticket still relevant or can it be closed?

mmd-osm commented 1 month ago

I don't think the issue has been resolved. Average times for the upload are still quite a bit higher than before. It would be helpful to get some stats from the cgimap logfile and see where the time is being spent. Without further details, it's fairly difficult to improve the situation.

{BF67FD20-1C3C-4B57-8565-223D7AA59500}
mmd-osm commented 1 month ago

snap-01 has just been restarted, upload seems better now:

image

@Firefishy : if the improvement is an expected outcome of the server maintenance, we can close this issue right away.

Firefishy commented 1 month ago

I replaced a faulty dimm. The faculty memory location would have been remapped by the CPU and likely would have had a performance impact... But I am not 100% sure.

mmd-osm commented 4 weeks ago

upload seems better now

That was probably a bit of an understatement.

Performance improvements are roughly in the 10x range, compared to last week.

Changeset uploads are even twice as fast as the respective pre-July runtimes (there were some performance related code changes back in August).