osmcode / osmium-tool

Command line tool for working with OpenStreetMap data based on the Osmium library.
https://osmcode.org/osmium-tool/
GNU General Public License v3.0
483 stars 104 forks source link

extract and check-refs use too much RAM with numerically high node IDs #234

Open RyanDeRose-TomTom opened 2 years ago

RyanDeRose-TomTom commented 2 years ago

What version of osmium-tool are you using?

osmium version 1.13.2 (v1.13.2-4-gf0657f8) libosmium version 2.17.1 Supported PBF compression types: none zlib lz4

What operating system version are you using?

Ubuntu 18.04.6 LTS

Tell us something about your system

8 CPU cores 32 GB RAM

What did you do exactly?

I have a custom-made pbf (contains nodes and ways, but no relations) filling the area of Luxembourg, and I attempted to extract a region corresponding to one of the two level 8 mercator tiles that contain Lux: 5.625,49.83798245308484,7.03125,50.73645513701065

With my file (15 MB), this fills up my system's RAM (32 GB) on the first pass and is killed. I tried check-refs to validate my file, and the same thing happened.

Eventually I tried renumbering my file, and afterwards, extract works very quickly, correctly, and with very little RAM. I am using large node IDs (which I can't really avoid for reasons) up to ~2e16 integers, so I set out to recreate the issue on an official extract:

wget https://download.geofabrik.de/europe/luxembourg-latest.osm.pbf osmium renumber luxembourg-latest.osm.pbf -o renum-1e16.osm.pbf -s 10000000000000000 osmium renumber luxembourg-latest.osm.pbf -o renum-2e16.osm.pbf -s 20000000000000000 etc, with the following extract: osmium extract -b 5.625,49.83798245308484,7.03125,50.73645513701065 renum-3e16.osm.pbf -o renum-extract.osm.pbf -O -v

These are all still reasonable numbers, since they are far below the upper limit of 2^63-1 ~ 9.22e18 I found that 1e16, 2e16, and 3e16 respectively used a max of 9, 18, and 28 GB of RAM (a linear increase in node ID number despite constant node count), with anything larger being killed.

I can work around this with renumbering, but this prevents work from being done in parallel on different machines which don't have access to the same renumbering index.

joto commented 2 years ago

This is a known limitation of the current implementation. This isn't a problem when used with OSM data, because it doesn't have those large IDs, so it is unlikely to get fixed. If this is needed for your non-OSM use case and you are willing to put some money into it, I do contract development. Please contact me directly.