Closed abyrd closed 4 years ago
I was thinking we might not even need a random-access store (in-memory Map or disk-backed MapDB), that we could load OSM in a streaming fashion by iterating over the input file more than once and using bitset-style ID trackers. We already do multiple passes to handle relations etc. The ID tracker is already implemented at org.opentripplanner.osm.NodeTracker though the name should be changed to reflect the fact that it will track node, way, and relation IDs.
Here's the idea: scan through one entity type at a time in the order relations-ways-nodes (the direction of inter-element dependencies) tracking those dependencies. Then scan through in the reverse order nodes-ways-relations converting all necessary entities directly into OTP graph objects (vertices and edges) without intermediate OSM model objects.
When scanning though ways, we'd track only those with tags we care about (usually highway=x), also tracking which nodes they reference and which nodes are intersections (using two cascading node trackers). This does avoid loading all nodes that are not in roads, which is important where most of OSM by volume is buildings (France, Netherlands). However, that filtering can easily be done when making the PBF input using external tools. Unfortunately we still need to load all intersection and non-intersection OSM nodes in highway=x ways, because the non-intersection nodes have coordinates used in a random-access fashion to make geometries.
So in the end, my conclusion is that we still need some random-access OSM store, and a MapDB keyed on OSM entity ID is probably the best way to do it. So I'm going to continue down that path.
@buma @bmander, tagging you on this since I discussed this with both of you and am interested in any comments you might have.
Challenges in converting the OSM builder to use the new MapDB-based OSM loader:
Advantages that still require a lot of rewriting:
I have come to the conclusion that the random access approach is much easier to understand and maintain that streaming loading. If we reversed the order of entity typed in the incoming OSM streams it might be possible to load them cleanly with no random-access storage, but it's not feasible to change that ordering on all input files.
Storing tags all concatenated in a string was clearly too slow. Updated to a List<P2<String>>
implementation in 0302427359f8a3f66e5c97d22020ebd20134612c. This is comparable in speed to a Map<String, String> implementation (checked).
Just for future reference, the biggest changes to adapt to the new OSM data model are:
The place where these changes are the most horrendous to apply is in the area routing (visibility graph) code. One solution would be to fetch all the Nodes and Ways involved in areas and keep those in memory, then operate on the in-memory objects rather than the ones in the MapDB.
One of the main advantages of using MapDB is that memory consumption for OSM representation during graph build is essentially constant relative to PBF input size. Keeping areas in memory would cause graph building memory consumption to increase as input size increases, but probably not very much. So most of the street edge construction would be done by iterating over disk-backed maps, and only area edge construction would occur in memory.
Another solution is to greatly simplify area routing, e.g. considering only the outer rings of each area such that it's easy to rewrite but it seems like a shame to lose this detailed pedestrian routing.
+1 for keeping detailed pedestrian routing. We're relying on this for smaller campus-based OTP deployments that include a lot of pedestrian areas/ways.
This conversation on the MapDB mailing list deals with objects that contain their own keys as fields: https://groups.google.com/forum/#!topic/mapdb/Yr0r_ThHnNk
The conclusion is that you're better off A) keeping it normalized with the keys separate from the values, or B) making a Set instead of a Map and making the whole object you're storing the key. B) would involve some really odd redefinitions of equals and hashcode, and some odd query methods so I guess we'll just have to deal with keys and values being separated.
Using custom serializers in MapDB yields a huge savings in disk space. This is because custom BTree key serializers apply delta-coding and variable width integer coding, and custom value serializers can use variable-width integer coding and delta-code node references as in PBF, though the latter requires adapting a chunk of the protobuf Java library. I'll see if Jan Kotek is interested in adding this functionality to MapDB itself.
Note that the OSM data model and MapDB wrapper are being pulled out into a separate library so it can be shared with other projects: https://github.com/conveyal/osm-lib
Closing since osm-lib
has been a separate project for 5 years, now merged into Conveyal R5, but not used by OTP.
Already exists as a prototype at org.opentripplanner.osm.OSMMain. Potentially we don't even need to load the ways into the MapDB, we can stream through them when loading.