openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
98 stars 13 forks source link

Wiki slowdown #1040

Closed matkoniecz closed 4 months ago

matkoniecz commented 4 months ago

https://prometheus.openstreetmap.org/d/HMPwSI2Gz/php-fpm?orgId=1&refresh=1m&viewPanel=10&var-instance=konqi&var-pool=wiki.openstreetmap.org&from=now-60d&to=now

Wiki was reported as being really slow by user.

Looking at this graph it seems to be visible also in statistics

screen

tomhughes commented 4 months ago

I think that chart is misleading.

I'm not sure what was going on but there was no sign of increased CPU load or requests over the same time period so I suspect there were some outliers throwing the average off or something. I've restarted php-fpm now which seems to have cleared it.

I don't think it was responsible for any slowness in answering requests though. I think that relates more to the gaps in this chart:

image

which represent times when a spike in requests causes us to run out of php-fpm workers.

I've blocked a few scraper idiots to see if that helps though I'm not sure if they were really the problem.

matkoniecz commented 4 months ago

I've blocked a few scraper idiots to see if that helps though I'm not sure if they were really the problem.

Maybe some badly written scrapper kept hiding timeouting/expensive page and inflated response time on that graph?

Whatever was measured by https://prometheus.openstreetmap.org/d/HMPwSI2Gz/php-fpm?orgId=1&refresh=1m&viewPanel=10&var-instance=konqi&var-pool=wiki.openstreetmap.org&from=now-1d&to=now got better so I will close this one

screen01

tomhughes commented 4 months ago

Like I say I'm not at all sure that graph is helpful here though I'm not sure what is going on with it.

I don't think I'd call this fixed yet anyway - there was another spike in runners this morning that came close to running out.

I suspect we may need to increase the maximum size of the pool - the spike this morning had at least two people fetch Map_Features at more or less the same time and a single fetch of that can lead to 600 or so loads of images and other resources in a short period of time.

Firefishy commented 4 months ago

The issues seem to be resolved now.