Closed mmd-osm closed 1 month ago
Well I think you've answered your own question haven't you? It's almost certainly a side effect of the CDN being in front of it.
For client side end-to-end measurements I would immediately agree to this. If I'm not mistaken, Prometheus is collecting response times from Apache log files, by which time we should be behind the CDN already.
What I'm really trying to rule out is that CGImap started behaving in unexpected ways due to the added CDN. If the additional times in Prometheus can be fully attributed to CF and not to some oddity in CGImap, then that's all good for me.
Some data about the Cloudflare proxy we still have enabled:
/api/*
is forced to bypass cache. The average response size is what I find odd. Could single API calls be being broken into chunks on the HTTP level and that change the responses?
Cloudflare does support transcoding origin traffic, which may be having an effect too.
We don't by default log the Content-Encoding
response header, but I've enabled it briefly and it looks like many of the /api/
responses are using Brotli compression. Tiny number are using gzip. Some are uncompressed.
This might account for the increased response time and decreased response sizes.
CGImap writes some extensive log information, including the total time spent for a changeset upload. If these numbers haven't changed much (or they don't match the numbers Prometheus is showing), this could also mean that Apache is doing some more work now.
Per cgimap logs:
Date/Time | Total Response Time | Requests | Average Response Time Per Request | Total Bytes | Average Bytes per Request |
---|---|---|---|---|---|
11 July 2024 00:00 to 07:59 | 38640026 | 575229 | 67.1732927234197 | 117431152457 | 204146 |
18 July 2024 00:00 to 07:59 | 39596153 | 664252 | 59.610137417727 | 131793556948 | 198408 |
Apache is able to compress proxy responses.
Thanks! Are these numbers filtered on changeset upload requests only? 60ms seems a bit low. We had about 100ms on average earlier on, according to Prometheus (like shown on the very first screenshot I posted above).
600k requests in 8 hours is also a bit high. I would expect at most 30k-60k requests, since we almost never see more than 2 requests/s for uploads (-> https://prometheus.openstreetmap.org/d/5rTT87FMk/web-site?orgId=1&from=1721260800000&to=1721289600000&viewPanel=12)
Thanks! Are these numbers filtered on changeset upload requests only? 60ms seems a bit low. We had about 100ms on average earlier on, according to Prometheus (like shown on the very first screenshot I posted above).
600k requests in 8 hours is also a bit high. I would expect at most 30k-60k requests, since we almost never see more than 2 requests/s for uploads (-> https://prometheus.openstreetmap.org/d/5rTT87FMk/web-site?orgId=1&from=1721260800000&to=1721289600000&viewPanel=12)
Sorry, I should have been clear. It is all cgimap responses from spike-06
during the time windows.
Per cgimap log data for changeset uploads from spike-06
:
Date/Time | Total Response Time | Requests | Average Response Time Per Request | Total Bytes | Average Bytes per Request |
---|---|---|---|---|---|
11 July 2024 00:00 to 07:59 | 2527646 | 5738 | 440.50 | 16965297 | 2956 |
18 July 2024 00:00 to 07:59 | 2867255 | 4663 | 614.89 | 17545566 | 3762 |
Apache is able to compress proxy responses.
Thanks a lot. this looks much better. Do you happen to have some stats for July 9 or July 10, maybe? Performance degradation started on July 10th at around 21:00 UTC. It would very interesting to see some stats before that point.
I enabled the tiered caching around 21:00 UTC on 11th July 2024.
Thanks a lot. this looks much better. Do you happen to have some stats for July 9 or July 10, maybe? Performance degradation started on July 10th at around 21:00 UTC. It would very interesting to see some stats before that point.
No. Earliest we currently have is July 11 from 00:00 UTC.
10th July 2024 @ 9pm UTC was the maintenance window. I have enabled options which add some overhead to RAM on the primary database server snap-01. The overhead is expected to be upto around 10%.
The server was also rebooted into a new kernel version which can affect performance.
Good point. This would indicate that we're likely spending more time on database queries, and the slowdown is unrelated to CF.
The CGImap log files should include a breakdown per each SQL statement. Maybe there's a small number of statements in there that's responsible for a major part of the runtime. We could then take a closer look, what's going on there.
I had a look at the snap-01 PostgreSQL stats. For some reason we're seeing significantly more sequential scans since the reboot (5/s instead of 0.0x/s)
The following chart suggests it's oauth_tokens, which is kind of odd since it's used for oauth 1.0a.
It's not clear if this is related to our issue here with cs uploads.
One thing to try out is to restart all 3 CGImap instances on spike-0[6-8] and check if things are improving.
On July 10th, the db host has been switched twice (https://github.com/openstreetmap/chef/commit/f2e063ccae80b92c17712da609d8d74ffef08ba9 and https://github.com/openstreetmap/chef/commit/b7f67ceb48b5f0fc94e9a9db74ddcf4a98a50fe9), which might have left CGImap in some strange state.
Without seeing the log files, it's fairly impossible to figure out what's going on. So this is a bit of a wild guess only.
After today's snap-01 restart, response times are slightly better again. Are we experiencing some issue with the storage, e.g reduced write I/O, that is impacting heavy insert activities on the db? If so, i would close this issue, since the root cause would be known.
Is this ticket still relevant or can it be closed?
I don't think the issue has been resolved. Average times for the upload are still quite a bit higher than before. It would be helpful to get some stats from the cgimap logfile and see where the time is being spent. Without further details, it's fairly difficult to improve the situation.
snap-01 has just been restarted, upload seems better now:
@Firefishy : if the improvement is an expected outcome of the server maintenance, we can close this issue right away.
I replaced a faulty dimm. The faculty memory location would have been remapped by the CPU and likely would have had a performance impact... But I am not 100% sure.
upload seems better now
That was probably a bit of an understatement.
Performance improvements are roughly in the 10x range, compared to last week.
Changeset uploads are even twice as fast as the respective pre-July runtimes (there were some performance related code changes back in August).
I'm wondering why the changeset upload started to be significantly slower since July 11, while all other request types remained stable or even showed some improved response time.
Also, the average response size shows an unexpected drop in the same time interval.
To me it seems something is working in a somewhat different way now, though I have no idea where these differences are coming from. Does CF add some additional latency that would show up in the logs now?