openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
98 stars 13 forks source link

Gray tiles #366

Closed nrenner closed 3 years ago

nrenner commented 4 years ago

[Previous discussion in openstreetmap/chef#264, opening a new issue as it doesn't seem clear if the style release really is the true source of the problem. Please let me know if this should be reported somewhere else.]

People still are reporting slow and patchy tile delivery for the standard style on openstreetmap.org (tile.openstreetmap.org) this week.

Screenshot Screenshot after one minute of loading, from today by kreuzschnabel (map)

Other reports this week:

tomhughes commented 4 years ago

Yes demand is in excess of supply, squid is shit and service is collapsing under the load.

Sadly there is very little I can do to help. I can sometime fix an individual case but normally it just moves the problem.

nrenner commented 4 years ago

I can reproduce when zooming in somewhere rural (or to z19, or in the ocean) where tiles have not been (re-)rendered yet and panning around a bit.

Don't know if you are already aware or if this helps:

In Chrome, in the network tab of the developer tools (F12) I added the custom response headers x-cache + x-tilerender (see Chrome DevTools Reference), so it shows for all tile requests what cache and tile servers are used:

Screenshot Chrome network tab with custom headers https://www.openstreetmap.org/#map=13/34.4091/-40.3620

I observe the following pattern when panning around (my IP located in Germany):

  1. for the first few request batches tiles are fast from a German cache (kalessin, katie, keizer, konqi) and odin
  2. in the next batch(es) some of the requests are slower (couple of seconds up to one minute), still from German servers
  3. but then tiles return 404 or take >= one minute, mostly from all kinds of cache and rendering servers

If I directly connect to odin.openstreetmap.org in a custom Leaflet map, everything is fine and most requests are <100ms.

nrenner commented 4 years ago

Would it be an option to remove the flag/timestamp from the carto release that marks all tiles as dirty/outdated and just rerender from now on when data changed?

tomhughes commented 4 years ago

Sounds like you're just hitting the deliberate rate limits if it starts fast and then gets slow after you have reached the bucket limit.

Prince-Kassad commented 4 years ago

I just got served a 1 week old tile on z18, and that suggests to me something is really wrong. Tile rendering has never been behind on the order of weeks before the update. Something is going very wrong there.

nrenner commented 4 years ago

I wondered where those patchy gray tiles come from, so I made some more tests this week. I would expect that for dropped metatiles all corresponding tiles within that rectangle would be missing, but instead gray tiles appear randomly between loaded tiles.

My test case is a single map call with 12 tiles at zoom 19 somewhere in the ocean. Called at various times this week, each time within a different single metatile, to ensure new rendering. To avoid throttling, I sometimes changed my IP, avoided other map uses and do no panning or zooming, just copying the URL hash from another map.

My observations:

An extreme example from yesterday at 16:28 (UTC) is this screenshot with four 404s, ten caches and successful requests to five (!) different render servers:

Screenshot 2020-02-27 17-33-26 CET

So for a single map call with 12 tiles, the same metatile was rendered five times instead of just once (see status request to odin, ysera, pyrene, scorch, bowser).

The 404 caches are all forwarding to rhaegal (according to Squid relay stats and nslookup: angor, sarkany, drogon, viserion).

Now I wonder: Why are individual requests within a single map call distributed to random caches? Why is there a failover/load balancing to other caches at all, as I haven't seen any 404s from odin?

fracgiu commented 4 years ago

I have the same problem I guess and I have some information more. In my case I go on Internet via Proxy, that requires authentication.

Without entering the details, when I open the HTML (before the map is loaded), sometime the page ask me for the proxy credentials, sometimes not (I don't know what's the logic but it's about cache). When the page ask for proxy credentials and I insert them everything works fine otherwise I have those errors (Open Layer Engine seems to be "offline", piece of the map are shown sometime depending on what there is in the cache I guess).

But also, even when is "offline" (and not working) sometime (but rarely) doing "Zoom In" and "Zoom out" the browser (not always) ask me for proxy credentials and the Open Layer Engine come "back online" and start to work again.

Now...as workaround maybe we can think about do a "request on the Internet" before loading everything or I don't know "skip the cache" in some way (I tried with the meta tag but nothing). Any idea?

Hope also this can help you.

pnorman commented 3 years ago

We have different render servers and are moving to a commercial CDN, so any capacity issues at the time this issue was opened are likely to be different now.