Memory usage splitting planet into multiple outputs

iandees commented 6 years ago

I'm reviving Mapzen's metro-extracts and am looking at using osmium extract to do the initial split of the OSM planet file (code is over at https://github.com/nextzen/metro-extracts/pull/2).

The config file I'm generating takes the GeoJSON of desired extracts and generates an osmium config file with two copies of every extract polygon. One polygon outputs to .osm.bz2, the other outputs to .osm.pbf. There are currently 200 cities, so I end up with a config file that has 400 polygons in it.

When I try to run this config file on a ~64GB RAM machine the osmium process uses one CPU core for several minutes and then gets killed by oom_killer.

Can you talk about the expected memory usage when using osmium extract? Should I expect memory usage to scale with polygon count? Should I run each extract with a separate osmium extract process using parallel? Or use smaller groupings of extract areas?

joto commented 6 years ago

See MEMORY USAGE section in the man page.

First you should probably get rid of the .osm.bz2 output and do that in a second step using osmium cat for each file (you can run (some of) those in parallel). Split up your 200 polygons into several config files and run them sequentially. You have to experiment a bit to see how many extracts you can do in each run.

Running several osmium extract processes in parallel is wasteful.

It might or might not help if you do the extract in steps, for instance cutting out a rough bounding box of Europe in the first step, the all cities in Europe from that.

iandees commented 6 years ago

Thanks for the information, @joto. I'll close this as I was going to offer to write some docs but you already did 😄.

iandees commented 6 years ago

MEMORY USAGE section of man page

frafra commented 1 year ago

I have used valgrind. There is no memory leak, but osmium keeps all the matching nodes in memory, even when using --strategy simple, since it needs their IDs to avoid writing members of ways or relations that are not in the extracted file.

https://github.com/osmcode/osmium-tool/blob/214cc1ea4016bee5deba5949dad7545655c58826/src/extract/strategy_simple.cpp#L63 https://github.com/osmcode/osmium-tool/blob/214cc1ea4016bee5deba5949dad7545655c58826/src/extract/strategy_simple.cpp#L73

Possible solutions:

Persist on disk
- Save one node list for each extract on the disk, instead of using the memory, or
- Use a previously created index file
Postpone the check
- Implement new strategy or option which avoids this consistency check and suggest to the user to run a dummy extraction afterward on each extracted file, or
- Implement an option for each strategy which skips the check and adds an extra final step to clean up the data

Postponing the check has the disadvantage to produce quite big files, since there could be ways and relations that include too many members; a bloom/cuckoo filter could be used to drastically reduce the number of members that are mistakenly written on disk. I also found that a paper has been published in 2022 on making a perfect cuckoo filter (thus no false positives or negatives), which could help reduce the memory requirements without changing the overall approach: https://pontarelli.di.uniroma1.it/publication/conext21/Conext21.pdf

frafra commented 1 year ago

I added a simple bloom filter to the simple strategy. I extracted ~50000 nodes, ~6000 ways and few relations out of a ~2 GB extract. osmium uses ~700 MB just for keeping the node ID, The bloom version uses basically no additional memory uses ~70 MB, and it has no performance penalty.

This is just an experiment, but it shows a viable path to use osmium for doing extracts of large areas.

https://github.com/osmcode/osmium-tool/compare/master...frafra:osmium-tool:bloom-experiment

joto commented 1 year ago

@frafra Please stop commenting on closed issues. It will just be ignored. If you have ideas on how to contribute, open a new issue for that. Note that possible solutions have to work with any size input data up to the whole planet and work with any size of extract from a tiny area to, again, the whole planet.

frafra commented 1 year ago

@joto Sorry, usually maintainers prefer not to have multiple issues for the same problem. It seems to me that issues should only be created if a solution is provided then :) I will follow your suggestion and further experiment so that osmium-tool/libosmium can extract from large areas as well without consuming too much memory.

osmcode / osmium-tool

Memory usage splitting planet into multiple outputs #109