pelias / openstreetmap

Import pipeline for OSM in to Pelias
MIT License
112 stars 72 forks source link

Import OSM Rels #81

Closed riordan closed 5 years ago

riordan commented 8 years ago

OSM has 3 data types:

We do not import rels at all. Rels are very important and useful. They have data applied to aggregate complex geometries, or networks, and have their own schema definitions to define the magic powers that each kind of rel can confer.

Do we really need this?

Autocomplete: pelias_geocoder_leaflet_plugin

Search: pelias_geocoder_leaflet_plugin

openstreetmap___relation__10_downing_street__1879842_

Yes.

missinglink commented 8 years ago

related: https://github.com/conveyal/vanilla-extract/wiki/Vex-Format

riordan commented 8 years ago

Related: https://github.com/whosonfirst/osm-tools

tuukka commented 7 years ago

Any ideas how we could move this forward?

tuukka commented 7 years ago

Should the relations be processed by pbf2json and is the approach started in a WIP pull request there still the way forward? https://github.com/pelias/pbf2json/pull/24

tuukka commented 7 years ago

@dianashk Could you help us here?

dianashk commented 7 years ago

Hey @tuukka, sorry I missed this earlier. We haven't done any planning of this functionality recently, so can't provide an estimate on implementation. However, you've found the right experiment branch @missinglink began working on a while back, which could get you close to this. If you have some cycles, have a look at the code and see if you can get it running. We're here to help if you get stuck.

missinglink commented 7 years ago

hey @tuukka, so there's two parts to this.

Firstly, there is storing the data required to denormalize the relations, as with the ways in pbf2json we could store the dependent node/way/relation records in leveldb or another data store, the only issues with this are 1. multiple passes over the PBF file (probably minimum 4-passes required in order to recurse the dependency tree and load the dependents) and 2. size of data on disk (probably well upwards of 100GB)

The second issue is assembling the parts of the relation (ie. the nodes, ways and sub relations) in to a single geometric entity (in most cases a multipolyon), this is actually not trivial to do because of how relations are modeled in OSM (most notably, how to assemble and join relation members of type 'outer' and 'inner' in the correct winding order), basically it's complex, error-prone and generally sucky.

I have been working on a replacement for pbf2json on-and-off for a while now and it's actually in a state I could publish it, just needs some tests. The main differences are: 1. pbf2json was my first Golang project and so it's messy, 2. the new system uses a faster parser (can parse the planet file in ~20 mins), 3. it is much more flexible with how it operates.

Most recently I have been experimenting with indexing the PBF file and then providing a random-access interface to the file, so you would just say "give me node 100" and it will seek the file and extract the entities you want, no need to load any data in to another data store like leveldb.

I think this would be the ideal approach for dealing with relations as its trivial to recurse down it's members and load all it's dependencies from the file in to RAM (it's still a little slow, can take a second or so for large country-sized relations).

The issue of assembling is not really well solved, for the osm-boundaries repo I used a node library called osmtogeojson which works well but it's a memory hog and the code is almost unreadable.

There are other programs which can assemble geojson or WKT geometries from OSM elements, such as osm2pgsql and possibly spatialite-tools or maybe ogr2ogr.

I tried writing it myself but got frustrated with all the edge cases and quirks that I gave up, there is a wiki outlining the assembly algorithm if you're interested. beware of role=subarea

so.. the tl;dr is that there has been some research going on behind the scenes about how to do this in a clean/nice way but it hasn't been considered as a super high priority.

I would be happy to publish the library I mentioned and help out if you'd like to take a crack at it yourself :)

tuukka commented 7 years ago

@missinglink Thanks for the update and the good news!

I wasn't considering all the issues caused by the size of the planet file as our use case is Finland. I got your branch running then hit the problem of multipolygon assembly. Until now, my plan was to fake it by calculating the centroid of the multipolygon as the centroid of the outer with the largest area.

Are you saying that the new library already does the PBF indexing or that it might be the next step?

I was familiar with the assembly algorithm but not role=subarea (of type=boundary, right?). Why is it problematic - is it because it can cause a deep hierarchy of relations?

missinglink commented 7 years ago

It's a nasty gotcha, role=subarea refers to it's parent and so recursing down the sub relations and assuming this reference is a child will result in a cyclic dependency graph :boom:

If you're just interested in Finland you can pull them out of my https://github.com/missinglink/osm-boundaries repo and use a centroid algo to find a center of mass, see: https://github.com/pelias/pbf2json/issues/22

Let me know if you get stuck or need some help.

missinglink commented 7 years ago

I included the label centroids in the geojson files, eg: https://raw.githubusercontent.com/missinglink/osm-boundaries/master/data/000/034/914/000034914.geojson

"labels": [
  {
    "id": "node/1372477580",
    "role": "admin_centre",
    "coordinates": [
      24.942566,
      60.167408
    ]
  }
]
tuukka commented 7 years ago

@missinglink I'm sorry I was being unclear above: I meant that in my case, it's enough to process the smaller finland.osm.pbf instead of the big planet.osm.pbf. After pbf2json, we load the data into Pelias (a Finland-wide instance forming a part of our fully open platform at https://digitransit.fi/ ). The centroid is currently enough for us, which means I could skip constructing the multipolygon geometry from the participating ways.

Then my question is what would be good enough for you to accept as a pull request: Do you need the multipolygon geometry? Do you need the real centroid? Do you need it to work on full planet.osm.pbf? In any case, I would still be interested in your new pbf2json replacement.

missinglink commented 7 years ago

hey @tuukka maybe we should do a call so I can understand.

we use pbf2json to process the whole planet, so any code that only worked on smaller extracts would need to be behind some sort of option flag with documentation in the readme about how it works and the caveats.

the part I'm still not understanding is specifically which relations are important to you?

I exported all the Finland relations to json http://missinglink.geo.s3.amazonaws.com/fin.relations.json from an old country extract I had from Aug 2016.

$ wc -l fin.relations.json 
23360 fin.relations.json

$ grep -v multipolygon fin.relations.json | wc -l
9250

multipolygons make up most of the data, for the rest it's things like train routes and bus routes which are complex multiline strings which would also need to be assembled in order to compute their centroids.

could you take me through your though process for the proposed PR? I'm definitely interested, just trying to understand the process of how it would work and what sort of entities you would be looking to extract.

missinglink commented 7 years ago

oh BTW, last night I figured out a more elegant way of assembling the complex relations, using sqlite and a recent version of spatialite.

in the docs http://www.gaia-gis.it/gaia-sins/spatialite-sql-4.4.0.html there is a function called BuildArea which works great at taking an arbitrary collection of lines and assembling them in to a polygon/multipolygon:

SELECT AsText( BuildArea(
  GeomFromText('MULTILINESTRING(
    (10 10, 20 20, 30 30),
    (30 30, 40 40, 20 40, 10 10)
  )')
));

POLYGON((30 30, 20 20, 10 10, 20 40, 40 40, 30 30))

this can then be exported from sqlite3 using SELECT AsGeoJson()

I tested in on a Berlin borough and it worked perfectly :)