Changeset upload performance since July 11th

mmd-osm commented 3 months ago

I'm wondering why the changeset upload started to be significantly slower since July 11, while all other request types remained stable or even showed some improved response time.

Also, the average response size shows an unexpected drop in the same time interval.

To me it seems something is working in a somewhat different way now, though I have no idea where these differences are coming from. Does CF add some additional latency that would show up in the logs now?

tomhughes commented 3 months ago

Well I think you've answered your own question haven't you? It's almost certainly a side effect of the CDN being in front of it.

mmd-osm commented 3 months ago

For client side end-to-end measurements I would immediately agree to this. If I'm not mistaken, Prometheus is collecting response times from Apache log files, by which time we should be behind the CDN already.

What I'm really trying to rule out is that CGImap started behaving in unexpected ways due to the added CDN. If the additional times in Prometheus can be fully attributed to CF and not to some oddity in CGImap, then that's all good for me.

Firefishy commented 3 months ago

Some data about the Cloudflare proxy we still have enabled:

Smart Tiered Cache is enabled, this increases hit rate. This may have a negative affect on uncached request latency.
/api/* is forced to bypass cache.
10.73M requests are cached out of a total of 35.16M requests in the last 24 hours.
316.78 GB (cached) out of 571.18 GB (total) in last 24 hours.
~5000 requests in last 24 hours have received "browser challenge", these are mostly tor browsers. Security level is set to low.

pnorman commented 3 months ago

The average response size is what I find odd. Could single API calls be being broken into chunks on the HTTP level and that change the responses?

Firefishy commented 3 months ago

Cloudflare does support transcoding origin traffic, which may be having an effect too.

Firefishy commented 3 months ago

We don't by default log the Content-Encoding response header, but I've enabled it briefly and it looks like many of the /api/ responses are using Brotli compression. Tiny number are using gzip. Some are uncompressed.

This might account for the increased response time and decreased response sizes.

mmd-osm commented 3 months ago

CGImap writes some extensive log information, including the total time spent for a changeset upload. If these numbers haven't changed much (or they don't match the numbers Prometheus is showing), this could also mean that Apache is doing some more work now.

Firefishy commented 3 months ago

Per cgimap logs:

Date/Time	Total Response Time	Requests	Average Response Time Per Request	Total Bytes	Average Bytes per Request
11 July 2024 00:00 to 07:59	38640026	575229	67.1732927234197	117431152457	204146
18 July 2024 00:00 to 07:59	39596153	664252	59.610137417727	131793556948	198408

Apache is able to compress proxy responses.

mmd-osm commented 3 months ago

Thanks! Are these numbers filtered on changeset upload requests only? 60ms seems a bit low. We had about 100ms on average earlier on, according to Prometheus (like shown on the very first screenshot I posted above).

600k requests in 8 hours is also a bit high. I would expect at most 30k-60k requests, since we almost never see more than 2 requests/s for uploads (-> https://prometheus.openstreetmap.org/d/5rTT87FMk/web-site?orgId=1&from=1721260800000&to=1721289600000&viewPanel=12)

Firefishy commented 3 months ago

Thanks! Are these numbers filtered on changeset upload requests only? 60ms seems a bit low. We had about 100ms on average earlier on, according to Prometheus (like shown on the very first screenshot I posted above).

600k requests in 8 hours is also a bit high. I would expect at most 30k-60k requests, since we almost never see more than 2 requests/s for uploads (-> https://prometheus.openstreetmap.org/d/5rTT87FMk/web-site?orgId=1&from=1721260800000&to=1721289600000&viewPanel=12)

Sorry, I should have been clear. It is all cgimap responses from spike-06 during the time windows.

Firefishy commented 3 months ago

Per cgimap log data for changeset uploads from spike-06:

Date/Time	Total Response Time	Requests	Average Response Time Per Request	Total Bytes	Average Bytes per Request
11 July 2024 00:00 to 07:59	2527646	5738	440.50	16965297	2956
18 July 2024 00:00 to 07:59	2867255	4663	614.89	17545566	3762

Apache is able to compress proxy responses.

mmd-osm commented 3 months ago

Thanks a lot. this looks much better. Do you happen to have some stats for July 9 or July 10, maybe? Performance degradation started on July 10th at around 21:00 UTC. It would very interesting to see some stats before that point.

Firefishy commented 3 months ago

I enabled the tiered caching around 21:00 UTC on 11th July 2024.

Firefishy commented 3 months ago

Firefishy commented 3 months ago

Thanks a lot. this looks much better. Do you happen to have some stats for July 9 or July 10, maybe? Performance degradation started on July 10th at around 21:00 UTC. It would very interesting to see some stats before that point.

No. Earliest we currently have is July 11 from 00:00 UTC.

Firefishy commented 3 months ago

10th July 2024 @ 9pm UTC was the maintenance window. I have enabled options which add some overhead to RAM on the primary database server snap-01. The overhead is expected to be upto around 10%.

The server was also rebooted into a new kernel version which can affect performance.

mmd-osm commented 3 months ago

Good point. This would indicate that we're likely spending more time on database queries, and the slowdown is unrelated to CF.

The CGImap log files should include a breakdown per each SQL statement. Maybe there's a small number of statements in there that's responsible for a major part of the runtime. We could then take a closer look, what's going on there.

mmd-osm commented 3 months ago

I had a look at the snap-01 PostgreSQL stats. For some reason we're seeing significantly more sequential scans since the reboot (5/s instead of 0.0x/s)

https://prometheus.osm.org/d/77k_bqFMz/postgresql?orgId=1&refresh=1m&var-instance=snap-01&var-datname=All&from=now-30d&to=now&viewPanel=24

The following chart suggests it's oauth_tokens, which is kind of odd since it's used for oauth 1.0a.

https://prometheus.osm.org/d/ST-7bi5Gz/api-database?orgId=1&refresh=1m&var-instance=All&var-master=snap-01&var-slaves=All&from=now-30d&to=now&viewPanel=4

It's not clear if this is related to our issue here with cs uploads.

mmd-osm commented 3 months ago

One thing to try out is to restart all 3 CGImap instances on spike-0[6-8] and check if things are improving.

On July 10th, the db host has been switched twice (https://github.com/openstreetmap/chef/commit/f2e063ccae80b92c17712da609d8d74ffef08ba9 and https://github.com/openstreetmap/chef/commit/b7f67ceb48b5f0fc94e9a9db74ddcf4a98a50fe9), which might have left CGImap in some strange state.

Without seeing the log files, it's fairly impossible to figure out what's going on. So this is a bit of a wild guess only.

mmd-osm commented 2 months ago

After today's snap-01 restart, response times are slightly better again. Are we experiencing some issue with the storage, e.g reduced write I/O, that is impacting heavy insert activities on the db? If so, i would close this issue, since the root cause would be known.

Firefishy commented 1 month ago

Is this ticket still relevant or can it be closed?

mmd-osm commented 1 month ago

I don't think the issue has been resolved. Average times for the upload are still quite a bit higher than before. It would be helpful to get some stats from the cgimap logfile and see where the time is being spent. Without further details, it's fairly difficult to improve the situation.

mmd-osm commented 1 month ago

snap-01 has just been restarted, upload seems better now:

@Firefishy : if the improvement is an expected outcome of the server maintenance, we can close this issue right away.

Firefishy commented 1 month ago

I replaced a faulty dimm. The faculty memory location would have been remapped by the CPU and likely would have had a performance impact... But I am not 100% sure.

mmd-osm commented 4 weeks ago

upload seems better now

That was probably a bit of an understatement.

Performance improvements are roughly in the 10x range, compared to last week.

Changeset uploads are even twice as fast as the respective pre-July runtimes (there were some performance related code changes back in August).

openstreetmap / operations

Changeset upload performance since July 11th #1118