osm2pgsql using too much memory for expire lists

joto commented 3 months ago

There is some talk about osm2pgsql using too much memory for expire lists. This is something we should look into. Also mentioned in #2185. Independent of any decisions or configuration options that change how expire works, we must always ensure system stability, a process that uses "unbounded" amounts of memory is obviously not okay.

Does anybody have more information about this?

tomhughes commented 3 months ago

All I really know is that faced with the big change lists from the recent vandalism we were seeing it OOM trying to process the diffs - when it did manage to output the lists they could be 60Gb in size.

tomhughes commented 3 months ago

This is from one of the render servers:

You can see the memory repeatedly ramping up until osm2pgsql gets OOMed then eventually renderd gets killed as well and that leaves enough headroom for it to succeed.

simonpoole commented 3 months ago

My, very old, osm2pgsql instance came to a stand still during the same event. This is a 128 GB machine and osm2pgsql was hitting 110GB+, processing the relevant diffs with tile expiration turned off allowed things to succeed.

pnorman commented 3 months ago

If we stored tile locations in a fixed size array it would take 5.3GB for z19 expiry. That's a bad structure for small lists of tiles, but it indicates the scale of the max RAM usage should be low 10s of GB, not 100s of GB.

simonpoole commented 3 months ago

I would note that the server only generates and expires tiles up to z18 so the the memory required should be even less.

joto commented 3 months ago

The current code basically uses a std::unordered_set<int64_t> for storing to expired tiles. Entries in an unordered set (internally a hash) have a large memory overhead. We probably need something like 32 bytes for each entry. Looking at the spikes in the graph above it seems we needed something like 64 GB in those cases, which means we had 2 billion tiles in our expire list.

Written out Tom said we had 60 GB of expire list, the list uses lines in a format like zoom/x/y. On zoom 18 the largest tile coordinates needs 6 digits each for x and y plus two digits for zoom level plus two / characters and a new line, so that's a maximum of 6+6+2+2+1 = 17 characters per entry. Depending on configration the file will also have entries for lower zoom levels. Upper limit for that would be double the size, so 34 characters per entry which puts fits with the calculation above.

2 billion tiles (on z18) is much more than I would ever have expected to see in an expire list. the maximum you can have at z18 is 4^18 which is 64 billion tiles.

tomhughes commented 3 months ago

We actually expire at z13-19 in production.

joto commented 3 months ago

We should find a better datastructure for keeping the tiles. The current solution is okay for small lists, but if the list gets full, we could switch to a bitmap which gives us a solid upper bound for the memory use. Problem is that that's still too large, especially if we have configured several expire outputs. For z19 we need 4^19 / 8 bytes, thats 32 GB. We could use a bloom filter to reduce memory use but get a certain false positive rate.

A better approach might be to reduce the zoom level dynamically if there are many tiles. With each zoom level we go up we need 4x less memory, so for z17 we are already at a maximum of 2GB which should be okay. The advantage of that approach is that tiles are probably somewhat clustered anyway, so we don't have as many false positives as with the bloom filter. And because we are expering up the zoom levels anyway, the middle zoom levels (which are the expensive ones to render) are unaffected.

Of course the disadvantage with that approach is that while we are protecing osm2pgsql from needing too much memory, we put a somewhat heavier burden on the rendering system.

We can come up with other ways to store the tiles, of course, they probably all of some "failure case", some tile pattern where they need a lot of memory. Something like a checkerboard pattern will probably overstretch all data layouts.

We might also want to look at the way we are reporting the expire list. Currently it is this text file in a CSV-like format. That has, of course, a huge overhead. Any other format needs to be implemented on the consumer side, too. Another option already available for the flex output is storing the expire list in the database.

pnorman commented 3 months ago

list gets full, we could switch to a bitmap which gives us a solid upper bound for the memory use. Problem is that that's still too large, especially if we have configured several expire outputs

Multiple expire outputs is a good point. If it were a single output I don't think memory would be an issue because any server that can deal with a giant expiry list will have a decent amount of memory.

nrenner commented 3 months ago

As I understand, the osm2pgsql expire list is tracking individual tiles, while the OSMF servers render, store and expire metatiles (8x8 tiles)?

Metatiles for z19 should correspond to z16 tiles, for which we would not have a memory issue?

So, could there be an option to track metatiles in the expire list instead?

joto commented 2 months ago

@nrenner Good point. I didn't think of that. z16 should be no problem at all. If you are using metatiles, you just have to instruct osm2pgsql to use z16 tiles and transform tile numbers when updating. I don't think it is the job of osm2pgsql to do that.

@tomhughes how are you handling this on OSMF servers?

tomhughes commented 2 months ago

We're just feeding the file to the render_expired tool from mod_tile/renderd.

joto commented 2 months ago

One option would be to check the size of the expire list and exit osm2pgsql with an error message if it becomes too large. This is better than just using more and more memory until we are killed by OOM. In addition it would, in some sense, protect later processing steps in the pipeline from overload. Even if we could handle a huge expire list, other parts of the system will not, so it might be better to die early.

This is, of course, not a real solution, because it still means that sysadmin intervention is necessary. But it is something we can implement very easily.

So the idea would be to keep a count of the number of entries in the expire list (or lists) and die if this becomes to large. We'd need to set a maximum number of entries. This should ideally be something that is large enough so that in does not happen in normal day to day operation. We might want to make this configurable, in that case we still need a good default, and some guidance for the sysadmin on how to set this.

We can set this either based on the absolute number of entires in the expire list, a percentage of tiles in the specified zoom level of the expire list, or the memory used (and then use some estimate on how much memory an entry in the expire list needs). I am not sure which value would be best for admins to understand.

joto commented 2 months ago

https://github.com/openstreetmap/chef/commit/be274e1

osm2pgsql-dev / osm2pgsql

osm2pgsql using too much memory for expire lists #2190