Closed iandees closed 6 years ago
See MEMORY USAGE
section in the man page.
First you should probably get rid of the .osm.bz2
output and do that in a second step using osmium cat
for each file (you can run (some of) those in parallel). Split up your 200 polygons into several config files and run them sequentially. You have to experiment a bit to see how many extracts you can do in each run.
Running several osmium extract
processes in parallel is wasteful.
It might or might not help if you do the extract in steps, for instance cutting out a rough bounding box of Europe in the first step, the all cities in Europe from that.
Thanks for the information, @joto. I'll close this as I was going to offer to write some docs but you already did 😄.
I have used valgrind. There is no memory leak, but osmium keeps all the matching nodes in memory, even when using --strategy simple
, since it needs their IDs to avoid writing members of ways or relations that are not in the extracted file.
https://github.com/osmcode/osmium-tool/blob/214cc1ea4016bee5deba5949dad7545655c58826/src/extract/strategy_simple.cpp#L63 https://github.com/osmcode/osmium-tool/blob/214cc1ea4016bee5deba5949dad7545655c58826/src/extract/strategy_simple.cpp#L73
Possible solutions:
Postponing the check has the disadvantage to produce quite big files, since there could be ways and relations that include too many members; a bloom/cuckoo filter could be used to drastically reduce the number of members that are mistakenly written on disk. I also found that a paper has been published in 2022 on making a perfect cuckoo filter (thus no false positives or negatives), which could help reduce the memory requirements without changing the overall approach: https://pontarelli.di.uniroma1.it/publication/conext21/Conext21.pdf
I added a simple bloom filter to the simple strategy. I extracted ~50000 nodes, ~6000 ways and few relations out of a ~2 GB extract. osmium uses ~700 MB just for keeping the node ID, The bloom version uses basically no additional memory uses ~70 MB, and it has no performance penalty.
This is just an experiment, but it shows a viable path to use osmium for doing extracts of large areas.
https://github.com/osmcode/osmium-tool/compare/master...frafra:osmium-tool:bloom-experiment
@frafra Please stop commenting on closed issues. It will just be ignored. If you have ideas on how to contribute, open a new issue for that. Note that possible solutions have to work with any size input data up to the whole planet and work with any size of extract from a tiny area to, again, the whole planet.
@joto Sorry, usually maintainers prefer not to have multiple issues for the same problem. It seems to me that issues should only be created if a solution is provided then :) I will follow your suggestion and further experiment so that osmium-tool/libosmium can extract from large areas as well without consuming too much memory.
I'm reviving Mapzen's metro-extracts and am looking at using
osmium extract
to do the initial split of the OSM planet file (code is over at https://github.com/nextzen/metro-extracts/pull/2).The config file I'm generating takes the GeoJSON of desired extracts and generates an osmium config file with two copies of every extract polygon. One polygon outputs to .osm.bz2, the other outputs to .osm.pbf. There are currently 200 cities, so I end up with a config file that has 400 polygons in it.
When I try to run this config file on a ~64GB RAM machine the osmium process uses one CPU core for several minutes and then gets killed by oom_killer.
Can you talk about the expected memory usage when using osmium extract? Should I expect memory usage to scale with polygon count? Should I run each extract with a separate
osmium extract
process usingparallel
? Or use smaller groupings of extract areas?