create a preprocessing script

carocad commented 7 years ago

Currently the server needs to download, parse and process the OSM file on boot.

The problem with this approach is that the parsing and processing steps are repeated every time that a server boots even if the result of that parsing is the same.

A better approach would be to preprocess the files once and then use that file as base.

An example approach is here: https://github.com/Project-OSRM/osrm-backend#using-docker

carocad commented 7 years ago

here are the results of some experiments for outputting the network to a file to avoid preprocessing it every time:

json format: 42.1 MB, compressed: 8.7 MB
smile format: 25 MB, compressed: 10.4 MB

write time: ~1.2 seconds read time (until last element): ~1.2 seconds

here is the code used in dev.clj

(time (cheshire/generate-stream @(:network (:grid system))
        (clojure.java.io/writer "resources/saarland.json")))

(time (last (cheshire/parse-stream
              (clojure.java.io/reader "resources/saarland.json"))))

carocad commented 7 years ago

On a similar matter: Java has native support for Gzip and Zip files. Node js on the other hand requires a library for processing them.

My point here being that it would be nice to have one that works on both environments such that we could create in one environment (like JS lambda function) and then read it in another (JVM).

The downside is that the files are a bit larger

Furthermore this would reduce the dependencies of the project, which is never bad ;)

carocad commented 7 years ago

@mehdisadeghi could you take care of this as well :) ?

After several experiments and some research I came to the conclusion that the best way to remain inter-operable, re-use most of our current code and still gain performance is to keep the file in the OSM format.

Problem description:

starting a server takes a long time due to the preprocessing step
the preprocessing step takes a long time due to
- OSM files large content
- the way that the xml tags are organized in OSM files: nodes, then ways then relation. This goes from granular to abstract.
- having to read the file twice: once to get the ways, then to get the nodes associated with the ways that we are interested instead of all of them. Here is a quick overview of the filtering results.

The idea behind this issue is to tackle all of those problems simultaneously with a simple script (Python?)

reduce the starting time by using smaller OSM files
reduce the OSM file size by removing the information that we are not interested in advance
reorder the content of OSM files such that ways comes before nodes thus allowing a single pass read.

I tried a small sketch of this here but it proved quite difficult in Clojure since it requires an in-place mutations which are troublesome. I used a setup variable to configure which attributes and which elements should stay in the file. It doesnt need to be like that initially. I was simply trying to make it flexible :)

mehdisadeghi commented 7 years ago

@carocad I have to give it some thinking. Of course we can use Python to do this, but I am wonder whether we have to try to come up with a more abstract high level design to handle preprocessing and feeding data into the routing application. I'll start with the OSM2GTFS and will update with an initial design for this one too.

carocad commented 7 years ago

@mehdisadeghi here are my 2 cents to this discussion :)

After trying several preprocessing options like json,smile and edn I realized that I was reinventing the wheel.

Certainly it is possible to come up with a design that it is more tailored to our specific needs and that is way smaller than the original one. However, in order to do that it would be necessary to have an specification for the shape of the file, which fields are required and which are optional, what to do in case the field is not present, etc.

I studied a bit the approach from Graphhopper, OSRM and TripPlanner and most of them use their own representation for OSM files, which although valid, leads them to create their own set of tools for tackling the problems that arise whenever a custom format is created.

I would prefer if we avoid duplicating the work of others and also try to maximise the re-usability of our data (for example, if someone wants to plot in a map the input file or perform analytics on it). On that topic I actually found this tool, which I think solves all of our problems :)

boring but efficient solution :D

Let me know your thoughts once you finish the GTFS convertion and are more used to the OSM format.

hiposfer / kamal

create a preprocessing script #52