Open RyanDeRose-TomTom opened 2 years ago
This is a known limitation of the current implementation. This isn't a problem when used with OSM data, because it doesn't have those large IDs, so it is unlikely to get fixed. If this is needed for your non-OSM use case and you are willing to put some money into it, I do contract development. Please contact me directly.
What version of osmium-tool are you using?
osmium version 1.13.2 (v1.13.2-4-gf0657f8) libosmium version 2.17.1 Supported PBF compression types: none zlib lz4
What operating system version are you using?
Ubuntu 18.04.6 LTS
Tell us something about your system
8 CPU cores 32 GB RAM
What did you do exactly?
I have a custom-made pbf (contains nodes and ways, but no relations) filling the area of Luxembourg, and I attempted to extract a region corresponding to one of the two level 8 mercator tiles that contain Lux: 5.625,49.83798245308484,7.03125,50.73645513701065
With my file (15 MB), this fills up my system's RAM (32 GB) on the first pass and is killed. I tried check-refs to validate my file, and the same thing happened.
Eventually I tried renumbering my file, and afterwards, extract works very quickly, correctly, and with very little RAM. I am using large node IDs (which I can't really avoid for reasons) up to ~2e16 integers, so I set out to recreate the issue on an official extract:
wget https://download.geofabrik.de/europe/luxembourg-latest.osm.pbf osmium renumber luxembourg-latest.osm.pbf -o renum-1e16.osm.pbf -s 10000000000000000 osmium renumber luxembourg-latest.osm.pbf -o renum-2e16.osm.pbf -s 20000000000000000 etc, with the following extract: osmium extract -b 5.625,49.83798245308484,7.03125,50.73645513701065 renum-3e16.osm.pbf -o renum-extract.osm.pbf -O -v
These are all still reasonable numbers, since they are far below the upper limit of 2^63-1 ~ 9.22e18 I found that 1e16, 2e16, and 3e16 respectively used a max of 9, 18, and 28 GB of RAM (a linear increase in node ID number despite constant node count), with anything larger being killed.
I can work around this with renumbering, but this prevents work from being done in parallel on different machines which don't have access to the same renumbering index.