Load OSM into MapDB during graph build

abyrd commented 10 years ago

Already exists as a prototype at org.opentripplanner.osm.OSMMain. Potentially we don't even need to load the ways into the MapDB, we can stream through them when loading.

abyrd commented 10 years ago

I was thinking we might not even need a random-access store (in-memory Map or disk-backed MapDB), that we could load OSM in a streaming fashion by iterating over the input file more than once and using bitset-style ID trackers. We already do multiple passes to handle relations etc. The ID tracker is already implemented at org.opentripplanner.osm.NodeTracker though the name should be changed to reflect the fact that it will track node, way, and relation IDs.

Here's the idea: scan through one entity type at a time in the order relations-ways-nodes (the direction of inter-element dependencies) tracking those dependencies. Then scan through in the reverse order nodes-ways-relations converting all necessary entities directly into OTP graph objects (vertices and edges) without intermediate OSM model objects.

When scanning though ways, we'd track only those with tags we care about (usually highway=x), also tracking which nodes they reference and which nodes are intersections (using two cascading node trackers). This does avoid loading all nodes that are not in roads, which is important where most of OSM by volume is buildings (France, Netherlands). However, that filtering can easily be done when making the PBF input using external tools. Unfortunately we still need to load all intersection and non-intersection OSM nodes in highway=x ways, because the non-intersection nodes have coordinates used in a random-access fashion to make geometries.

So in the end, my conclusion is that we still need some random-access OSM store, and a MapDB keyed on OSM entity ID is probably the best way to do it. So I'm going to continue down that path.

@buma @bmander, tagging you on this since I discussed this with both of you and am interested in any comments you might have.

abyrd commented 9 years ago

Challenges in converting the OSM builder to use the new MapDB-based OSM loader:

The new OSM model does not keep the IDs in the entity objects themselves (for compactness). This generally just means passing an ID alongside the object, or storing/passing only IDs and fetching the object as needed (which should be fast enough because of MapDB's cache).
New OSM entity objects should be considered immutable because they are being pulled from a MapDB. Existing code changes them, communicating between different phases by adding tags. All of this tag modification should be replaced with local Maps. In certain places (intersecting areas with ways) nodes are inserted into the middle of ways. In those cases the ways must be written back out to the MapDB after being modified.

Advantages that still require a lot of rewriting:

The horrible reverse-order three-pass loading with callbacks after each phase becomes unnecessary.
Full review and refactor of all OSM loading code.

I have come to the conclusion that the random access approach is much easier to understand and maintain that streaming loading. If we reversed the order of entity typed in the incoming OSM streams it might be possible to load them cleanly with no random-access storage, but it's not feasible to change that ordering on all input files.

abyrd commented 9 years ago

Storing tags all concatenated in a string was clearly too slow. Updated to a List<P2<String>> implementation in 0302427359f8a3f66e5c97d22020ebd20134612c. This is comparable in speed to a Map<String, String> implementation (checked).

abyrd commented 9 years ago

Just for future reference, the biggest changes to adapt to the new OSM data model are:

We cannot use identity equality anywhere because the objects are being pulled out of a key-value store that may or may not return the same object instance each time you call get() on a given key. we have to refer to all entities by their long integer IDs.
The new OSM objects do not contain fields for their own IDs. This is to prevent serializing the IDs twice, once as the key and once in the object itself. This means frequently passing the ID and the object separately. I am considering adding a field for the ID and just making it transient, but then you have to do some magic to fill the ID back in when it's deserialized.
A lot of the tag-checking functions ("is this way a roundabout? does it forbid bicycling in the reverse direction?") are moved out of the data model objects into a tag library. This is because we want this OSM model + MapDB loader to eventually be a general purpose library, and much of the tag checking functionality we define is specific to routing. It wouldn't necessarily be bad to include it in the library, but it should not be cluttering up the basic Way and Node classes.

The place where these changes are the most horrendous to apply is in the area routing (visibility graph) code. One solution would be to fetch all the Nodes and Ways involved in areas and keep those in memory, then operate on the in-memory objects rather than the ones in the MapDB.

One of the main advantages of using MapDB is that memory consumption for OSM representation during graph build is essentially constant relative to PBF input size. Keeping areas in memory would cause graph building memory consumption to increase as input size increases, but probably not very much. So most of the street edge construction would be done by iterating over disk-backed maps, and only area edge construction would occur in memory.

Another solution is to greatly simplify area routing, e.g. considering only the outer rings of each area such that it's easy to rewrite but it seems like a shame to lose this detailed pedestrian routing.

barbeau commented 9 years ago

+1 for keeping detailed pedestrian routing. We're relying on this for smaller campus-based OTP deployments that include a lot of pedestrian areas/ways.

abyrd commented 9 years ago

This conversation on the MapDB mailing list deals with objects that contain their own keys as fields: https://groups.google.com/forum/#!topic/mapdb/Yr0r_ThHnNk

The conclusion is that you're better off A) keeping it normalized with the keys separate from the values, or B) making a Set instead of a Map and making the whole object you're storing the key. B) would involve some really odd redefinitions of equals and hashcode, and some odd query methods so I guess we'll just have to deal with keys and values being separated.

abyrd commented 9 years ago

Using custom serializers in MapDB yields a huge savings in disk space. This is because custom BTree key serializers apply delta-coding and variable width integer coding, and custom value serializers can use variable-width integer coding and delta-code node references as in PBF, though the latter requires adapting a chunk of the protobuf Java library. I'll see if Jan Kotek is interested in adding this functionality to MapDB itself.

abyrd commented 9 years ago

Note that the OSM data model and MapDB wrapper are being pulled out into a separate library so it can be shared with other projects: https://github.com/conveyal/osm-lib

abyrd commented 4 years ago

Closing since osm-lib has been a separate project for 5 years, now merged into Conveyal R5, but not used by OTP.

opentripplanner / OpenTripPlanner

Load OSM into MapDB during graph build #1512