openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
99 stars 12 forks source link

Wiki pages with many Wikimedia Commons images often return HTTP 504 error #466

Closed 1ec5 closed 2 years ago

1ec5 commented 4 years ago

Over the past couple weeks, I’ve noticed that any wiki page with many images from Wikimedia Commons (a couple dozen or more?) returns a 504 Gateway Timeout error the first time you try to access it but returns the expected response if you retry the request shortly after. I’m not sure if the problem is on our end or Wikimedia’s.

Some examples of affected pages:

https://wiki.openstreetmap.org/wiki/Key:maxweight https://wiki.openstreetmap.org/wiki/Ohio/Map_features https://wiki.openstreetmap.org/wiki/United_States_roads_tagging/Routes https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative (spotted by another mapper on OSMUS Slack)

Some of these pages could probably stand to use fewer images – the flag icons on the maxweight page don’t make the page much more navigable. But even those icons weren’t causing any timeouts in the past.

pnorman commented 4 years ago

I've also been seeing this on pages with many images.

grischard commented 4 years ago

I've been seeing HTTP 429 when browsing Wikimedia Commons pages with lots of images, e.g. https://commons.wikimedia.org/wiki/User:Zinneke/Nei_Biller_iwwer_L%C3%ABtzebuerg/2020_October_11-20

Could this be a similar throttling mechanism?

AntiCompositeNumber commented 4 years ago

Since it's always the first thing I ask for:

GET /wiki/Key:maxweight HTTP/2
Host: wiki.openstreetmap.org
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Referer: https://github.com/openstreetmap/operations/issues/466
Connection: keep-alive
Cookie: forceHTTPS=true; wikiUserID=104750; wikiUserName=AntiCompositeNumber; wikiToken=***; VEE=wikitext; wikiEditor-0-toolbar-section=advanced; wiki_session=***
Upgrade-Insecure-Requests: 1
Pragma: no-cache
Cache-Control: no-cache
----
HTTP/2 504 Gateway Timeout
date: Tue, 03 Nov 2020 01:12:19 GMT
server: Apache/2.4.41 (Ubuntu)
strict-transport-security: max-age=31536000; includeSubDomains; preload
expect-ct: max-age=0, report-uri="https://openstreetmap.report-uri.com/r/d/ct/reportOnly"
content-length: 336
content-type: text/html; charset=iso-8859-1
X-Firefox-Spdy: h2
----
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>504 Gateway Timeout</title>
</head><body>
<h1>Gateway Timeout</h1>
<p>The gateway did not receive a timely response
from the upstream server or application.</p>
<hr>
<address>Apache/2.4.41 (Ubuntu) Server at wiki.openstreetmap.org Port 443</address>
</body></html>

I've been seeing HTTP 429 when browsing Wikimedia Commons pages with lots of images, e.g. https://commons.wikimedia.org/wiki/User:Zinneke/Nei_Biller_iwwer_L%C3%ABtzebuerg/2020_October_11-20

That's https://phabricator.wikimedia.org/T266155, which is difficult to fix while maintaining stability but is being worked on. It also applies only to thumbnails that have not been previously generated.

Could this be a similar throttling mechanism?

It's hard to know exactly what's going on without server logs or network traffic, but it doesn't look like it. It's not the same throttle as T266155, since that throttle is imposed by the thumbnailing system. InstantCommons works by querying the imageinfo API on Commons for the thumbnail URL and then hotlinking it. MediaWiki (here or on Commons) doesn't communicate directly with the thumbnailer about a request.

This is also an Apache error page, not a MediaWiki Exception, which I would expect if Commons was returning errors. I also found that the 504s were returned after 30 seconds every time. That makes it look like a timeout imposed by the OSM wiki Apache server itself, probably this Apache timeout.

Wikimedia timeouts are typically set to 1 minute, and the API query over the Internet is the most likely to be causing excess delays. While data is not available specifically for the imageinfo module, the expected timeout of 1 minute holds true for all query modules combined: Distribution of execution time (API action=query). >95% of queries will finish in under 1 second, but that remaining few percent will stretch it all the way to a minute.

The $100,000 question is if this is a problem with API requests about a few specific files being really slow (and bringing everything else down with them) or if this is caused by a lot of files with moderate API latency that combine to exceed the timeout. My hunch is that it's more toward the former, but in reality it's probably a combination of both factors. The best way to figure that out would be through inspecting server logs, but one could use trial and error to see if any specific file is being problematic.

1ec5 commented 4 years ago

Thank you for the detailed analysis!

In case it’s relevant, each of the pages listed in https://github.com/openstreetmap/operations/issues/466#issue-726791145 has a bunch of fairly large thumbnails of SVGs. These pages have always taken a relatively long time to load but previously didn’t time out. On the other hand, https://wiki.openstreetmap.org/wiki/Main_Page has lots of little thumbnails of SVGs (a mix of Commons and local images) and it hasn’t ever returned 504 errors on me.

tomhughes commented 4 years ago

I'm not sure what commons has to do with anything... Isn't this just the usual problem, that we have ridiculous pages that take a long time to render if they've changed?

The "upstream" here is just because apache is proxying to FPM which is where the PHP to render the page is running - it's nothing to do with commons or any external connection.

tomhughes commented 4 years ago

I've increased the proxy timeout now to see if that improves things.

nemobis commented 3 years ago

has a bunch of fairly large thumbnails of SVGs.

Sounds unlikely to be the culprit: the thumbnails are hotlinked from Commons, so if they're slow to generate you might see an empty box but parsing should not be affected. Unless something changed in recent releases that I'm not aware of, the parser only needs to know the image metadata (mostly existence, size).

If you suspect that local SVG rendering is at fault, you may test the theory by replacing all local SVG files with remote ones.

[[Key:maxweight]] has 6 local files and 66 remote files. I consider that a low number.

I've increased the proxy timeout now to see if that improves things.

I managed to load all the pages linked in the description, but some took 30-60 seconds. Example:

<!-- 
NewPP limit report
Cached time: 20210110125238
Cache expiry: 86400
Dynamic content: false
Complications: []
CPU time usage: 5.209 seconds
Real time usage: 43.804 seconds
Preprocessor visited node count: 23460/1000000
Post‐expand include size: 122255/2097152 bytes
Template argument size: 23214/2097152 bytes
Highest expansion depth: 19/40
Expensive parser function count: 1/500
Unstrip recursion depth: 0/20
Unstrip post‐expand size: 41602/5000000 bytes
Lua time usage: 0.760/15 seconds
Lua virtual size: 9.57 MB/50 MB
Lua estimated memory usage: 0 bytes
Number of Wikibase entities loaded: 0/250
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00% 43599.598      1 -total
  2.46% 1073.747    226 Template:Tag
  2.22%  968.525    350 Template:TagPagename
  1.99%  867.373    367 Template:Langcode
  0.51%  221.585      8 Template:Convert
  0.27%  118.179      5 Template:LangSwitch
  0.23%   98.431      2 Template:Icon
  0.17%   75.044      2 Template:IconWay
  0.08%   36.894     11 Template:Nobr
  0.07%   30.312      3 Template:LL
-->

<!-- Saved in parser cache with key wiki:pcache:idhash:127219-0!userlang=it and timestamp 20210110125154 and revision id 2080710
 -->

Reloading the same page (Ohio) with another language takes much less (although still longer than I'd like), suggesting that whatever was the culprit is now cached:

CPU time usage: 1.629 seconds
Real time usage: 7.744 seconds
[...]

<!-- Saved in parser cache with key wiki:pcache:idhash:127219-0!userlang=la and timestamp 20210110130142 and revision id 2080710
 -->

I don't remember how smart the parser cache is here about reusing stuff across interface languages etc. It's probably useful to give more space to the various caches to reduce misses, starting from the parser cache which certainly helps.

I'm happy to help with further debugging if needed.

1ec5 commented 3 years ago

I'm not sure what commons has to do with anything... Isn't this just the usual problem, that we have ridiculous pages that take a long time to render if they've changed?

Sorry for the delayed response. That may be a problem in general, but I only opened this issue a few weeks after I (and others) noticed a marked degradation in performance that unfortunately hasn’t improved. The pages linked in the original post had been performing perfectly well when their image galleries were originally put together; I don’t think we would’ve kept piling onto those pages had the issue arisen back then.

Sounds unlikely to be the culprit: the thumbnails are hotlinked from Commons, so if they're slow to generate you might see an empty box but parsing should not be affected. Unless something changed in recent releases that I'm not aware of, the parser only needs to know the image metadata (mostly existence, size).

The issue is definitely related to images, not templates. We also have template-heavy pages and pages full of local images that don’t ever time out, but it’s the Commons-heavy pages that are regularly timing out or coming back with missing thumbnails, hence the initial assumption that InstantCommons has something to do with it.

nemobis commented 3 years ago

Il 28/10/21 08:19, Minh Nguyễn ha scritto:

The issue is definitely related to images, not templates.

Yes, I just meant that an error on the upload.wikimedia.org URL should not affect the generation of the HTML. On the other hand, if API requests to get image metadata from Commons get a long time, or time out, parsing could be blocked. So it's probably worth checking whether these can be optimised somehow.

1ec5 commented 3 years ago

Following the changes in https://github.com/openstreetmap/operations/issues/573#issuecomment-953761735, “MUTCD/R” still times out with an HTTP 504 error, but now there’s actually an error page rather than a blank page. (That’s an improvement for a pathologically image-heavy page. America has too many traffic signs!) Reloading produces the page with some missing thumbnails, as in #573. I wouldn’t be surprised if it’s some sort of timeout, though it always affects a randomly rotating set of images nondeterministically. 🤷‍♂️

On the other hand, “United States roads tagging/Routes” still times out with a blank page instead of the error page. I don’t know if that means anything.

HikeAndMap commented 3 years ago

https://github.com/openstreetmap/operations/issues/573

Concerning this issue - this has been resolved with the changes Tom Hughes made. It now consistently loads all images properly.

if not yet cached, takes a few seconds to load the page.. if cached loads within an eyeblink.. but either way it always loads.

With "extremely insane large image gallery" there's still an issue - even on the original wiki!

https://en.wikipedia.org/wiki/Comparison_of_European_road_signs that's on the original wiki and sometimes also fails to load all the images.

BUT

https://en.wikipedia.org/wiki/Comparison_of_European_road_signs?action=purge just add an ?action=purge behind it and do that once or twice and it's all there..

Explained to me in the MediaWiki group on Telegram by Jon Harald Søby - all credits go to this wonderful helpful person.

HikeAndMap commented 2 years ago

Hello Tom,

https://www.mediawiki.org/wiki/Extension:QuickInstantCommons

Might this be something to try to resolve this issue?

matkoniecz commented 2 years ago

https://t.me/wmhack/18639

Extension:QuickInstantCommons was described as currently best available solution - with caveat that it is in beta

happy5214 commented 2 years ago

https://t.me/wmhack/18639

Extension:QuickInstantCommons was described as currently best available solution - with caveat that it is in beta

QuickInstantCommons is now listed as stable, according to https://www.mediawiki.org/wiki/Extension:QuickInstantCommons. Perhaps we should investigate further?

happy5214 commented 2 years ago

https://github.com/openstreetmap/chef/pull/490 was reverted in https://github.com/openstreetmap/chef/pull/491. Please reopen.

Firefishy commented 2 years ago

I think this issue needs to be fixed before we can deploy: https://github.com/wikimedia/mediawiki-extensions-QuickInstantCommons/issues/1

It seems similar to why I had to revert.

tomhughes commented 2 years ago

Oh I thought deploying that was meant to be the fix.

Firefishy commented 2 years ago

I tested a deploy of https://github.com/wikimedia/mediawiki-extensions-QuickInstantCommons last night, it throws an exception. I had to revert adding it.

HikeAndMap commented 2 years ago

I tested a deploy of https://github.com/wikimedia/mediawiki-extensions-QuickInstantCommons last night, it throws an exception. I had to revert adding it.

Thanks for testing - I agree stability above anything else of course. It annoys me beyond acceptance that I have to load and reload and reload to show all the images - but I rather reload 10 times than a single exception of course.

1ec5 commented 2 years ago

I've been seeing HTTP 429 when browsing Wikimedia Commons pages with lots of images

Yes, the performance regression for Commons images – as opposed to the overall slowness in #725 – is partly because of a per-IP throttle that Commons instituted in May 2020 in response to a service reliability incident. This throttle is known to affect Commons itself, but not as severely as the OSM Wiki. I don’t know why it didn’t become apparent to OSM Wiki users until October 2020. (I regularly consult some of the image-heaviest pages on the wiki and definitely wouldn’t have missed this issue for five months.)

Fortunately, QuickInstantCommons has just been deployed in https://github.com/openstreetmap/operations/issues/573#issuecomment-1204630353 https://github.com/openstreetmap/operations/issues/725#issuecomment-1204631598, and so far it seems to have restored the performance of massive pages like the ones in https://github.com/openstreetmap/operations/issues/466#issue-726791145 to what it was prior to May 2020.

/cc @Firefishy

Firefishy commented 2 years ago

I think this is now resolved.

HikeAndMap commented 1 year ago

I can confirm - I hate to prematurely claim something has been fixed - after a few months of testing loading pages with large amounts of pictures.. it all works stable now .

Thanks to the team doing this! Much appreciated to take care of us mappers :)