pszufe / OpenStreetMapX.jl

OpenStreetMap (*.osm) support for Julia 1.0 and up
MIT License
118 stars 24 forks source link

Importing large maps #43

Open blegat opened 2 years ago

blegat commented 2 years ago

Just started to import Belgium. belgium-latest.osm.bz2 makes 750 MB, and the unpacked belgium-latest.osm takes 8.6 GB. I tried get_map_data("belgium-latest.osm"), it took some time until my computer ran out of its 16 GB of RAM and 6 GB of SWAP and then the Julia program was killed. I'm wondering if it would be possible to load such map given enough time, e.g. by storing things in the disc, I'm wondering if that's what graphhopper does with the _gh directory. Another solution that might help for medium size osm file would be to support .pbf, is that feature planned or in the scope of OpenStreetMapX ?

pszufe commented 2 years ago

One possible way to go would be to use osmfilter and have a reduced dataset (several pieces of information can be dropped without affecting the actual map content). Still the performance bottleneck is the XML parser used with the library, and, perhaps, a different one could be tried. Thanks for the PR - I will review it and I will be glad to help with that.

blegat commented 2 years ago

One possible way to go would be to use osmfilter and have a reduced dataset (several pieces of information can be dropped without affecting the actual map content).

Does MapData contain every information that was in the .osm file ? And does it contain information that it is not useful for computing shortest paths ?

Still the performance bottleneck is the XML parser used with the library, and, perhaps, a different one could be tried.

Indeed, but I would have expected that the memory complexity is lower. Since we use the callback API, it should have to represent the full XML at any time. So what is taking all the memory ? With Andorra, I get:

julia> d = @time get_map_data("/home/blegat/Downloads/andorra-latest.osm", use_cache=false);
  1.666097 seconds (10.15 M allocations: 805.471 MiB, 19.43% gc time)

julia> d = @time get_map_data("/home/blegat/Downloads/andorra-latest.osm", use_cache=false);
  1.444814 seconds (10.15 M allocations: 805.486 MiB, 8.79% gc time)

julia> d = @time get_map_data("/home/blegat/Downloads/andorra-latest.osm", use_cache=false);
  1.638368 seconds (10.15 M allocations: 805.471 MiB, 18.30% gc time)

julia> d = @time get_map_data("/home/blegat/Downloads/andorra-latest.osm", use_cache=false);
  1.446085 seconds (10.15 M allocations: 805.477 MiB, 10.57% gc time)

julia> Base.summarysize(d)
5150378

julia> d = @time get_map_data("/home/blegat/Downloads/andorra-latest.osm");
[ Info: Read map data from cache /home/blegat/Downloads/andorra-latest.osm.cache
  0.058867 seconds (280.02 k allocations: 15.695 MiB)

andorra-latest.osm.bz2 is 3.4 MB, andorra-latest.osm is 37.3 MB, andorra-latest.osm.cache is 2.1 MB and andorra-latest.osm.pbf is 1.8 MB. The size used by d seems to be 5.15 GB so for Belgium, we could expect MapData to be around 1.2 MB (8.6 / 37.3 * 5.15). That should fit in my RAM. Do you know what else was using so much memory that I ran out of RAM ?

pszufe commented 2 years ago

There are two versions of map parsers - routing oriented and raw

Routing oriented (does additional processing)

julia> sample_file = joinpath(dirname(pathof(OpenStreetMapX)),"..","test","data","reno_east3.osm");

julia> @btime get_map_data($sample_file;use_cache=false);
  93.730 ms (667494 allocations: 51.23 MiB)

Raw version (25% lighter):

julia> @btime OpenStreetMapX.parseOSM($sample_file);
  74.493 ms (576298 allocations: 42.94 MiB)

The code for collecting elements can be found at the beginning of parseMap.jl file. You can see that only a subset of nodes is parsed.

I actually run the profiler:

ProfileView.@proview OpenStreetMapX.parseOSM(sample_file);

If you try running it you can see that around 20% of time is OSMX while the rest is LibExpat. So perhaps one option would be to try a faster XML parser. Looking at the number of allocations it seems that LibExpat.jl is operating on Strings (rather than much faster Symbols) and is inefficient for large files.

pszufe commented 2 years ago

One more test:

julia> dat = String(read(sample_file));

julia> @btime xp_parse($dat);
  53.914 ms (739840 allocations: 66.95 MiB)

Hence currently the XML parser is the major source of problems. At the time we started repairing https://github.com/tedsteiner/OpenStreetMap.jl the LibExpat.jl was the best we could have - there were not too many great Julia stream based XML parsers at that time. Prhaps EzXML.jl could be a good new choice?

blegat commented 2 years ago

Yes, I think moving to EzXML might help.

pszufe commented 2 years ago

Hi, thanks for .pbf support! I have also updated the tests since they were relying on RNG and this has changed in Julia 1.6. Now all tests pass locally on the current Julia version. I can also see that Travis migrated their servers from travis-ci.org to travis-ci.com. Somehow I am not able to change the unit testing mechanism from .org to *.com. Travis.com seems not to be aware of the OpenStreetMapX (I just do not see the project in Travis list) - I still need to sort that out.

pszufe commented 2 years ago

I managed to reconfigure Travis and get everything to work, so now we have a new OpenStreetMapX release with pbf support! Should you need other functionality for your project (perhaps with some support on my site) please let me know. Thank you.

blegat commented 2 years ago

Thanks! I made a few changes that are mixed up in a branch of my fork: https://github.com/blegat/OpenStreetMapX.jl/tree/mixed_changes as well as in my fork of ProtoBuf.jl: https://github.com/blegat/ProtoBuf.jl/tree/mixed_changes I have started making separate PRs for each repo where each change is precisely motivated to make it easier to review and make sure each change is an improvement and don't break anything for existing users.