mapzen / metro-extracts

DEPRECATED. See readme for alternative ways to get "city-sized chunks" of OpenStreetMap data
ISC License
25 stars 27 forks source link

I saw something: broken boundary relations #273

Open missinglink opened 7 years ago

missinglink commented 7 years ago

heya,

I've been trying to extract boundary:administrative data from our PBF metro extracts and found that either all, or some of the members are missing from the relation.

An example would be http://www.openstreetmap.org/relation/1638991 which contains 43 outer members.

In the old metro extracts (wellington_new-zealand.osm.pbf) only 25 of the members were referenced, so unfortunately the polygon was invalid due to it not being able to close properly using the data available in the PBF file.

In the new metro extracts, all the members have been removed except for the admin_centre node.

The wiki for boundary says that a relation marked as boundary:* should have at least one member with the role of outer.

I understand that including all the member data in each extract would frequently include a bunch of data outside the bbox and could potentially recurse to include the whole planet file!!

Is it possible to do one of the following?

migurski commented 7 years ago

Good question; I’m a bit removed from the underlying OSM mechanics of this. I believe @heffergm created the underlying extract service, and would be able to answer any questions about possible differences between these new extracts and the older, weekly ones.

heffergm commented 7 years ago

All the osmconvert code amongst the two systems is identical in function. All the ODES code was heavily borrowed (read largely cut/paste) from the old metroextractor system.

... except the hash-memory fix that snuck into the old system didn't get ported over, so I'll add that now. It may have an impact, although I'm somewhat doubtful given the size of the extracts in question. But if you want to re-cut an extract @missinglink, let me know if it looks any different.

We've tried all kinds permutations of messing with including/excluding nodes. There are fundamental, systemic reasons we can't allow recursion (OOM) and at the end of the day every time we seem to choose one way, we get complaints about having preferred the other. I'm happy to change it to anything (provided it still functions) but at the rate it's happened in the past, I expect someone to ask me to switch it back (or to something else) within a few months.

missinglink commented 7 years ago

here some hard data:

for sake of simplicity, in my examples I used way elements which are part of a relation in the file.

the problem effects any element with member elements, so both way and relation element types.

old extracts

$ wget https://s3.amazonaws.com/metro-extracts.mapzen.com/new-york_new-york.osm.pbf

$ ls -lah new-york_new-york.osm.pbf 
-rw-rw-r-- 1 peter peter 79M Aug 13 14:53 new-york_new-york.osm.pbf

$ shasum new-york_new-york.osm.pbf 
428d95b58dad852bee12c299d8c7e0eaa9ab50b3  new-york_new-york.osm.pbf

I found the following ways using a script I wrote (it's only looking at ways which are a member of a relation with a single node ref):

way 24315304 only has 1 ref
way 242967138 only has 1 ref
way 37788892 only has 1 ref
way 242967138 only has 1 ref
way 37788892 only has 1 ref
way 37561688 only has 1 ref
way 37788892 only has 1 ref
way 37788892 only has 1 ref
way 242967138 only has 1 ref
way 37788892 only has 1 ref
way 242967138 only has 1 ref
way 37788892 only has 1 ref
way 24315304 only has 1 ref

which can be confirmed with osmconvert:

$ osmconvert --out-osm new-york_new-york.osm.pbf | grep -A6 24315304
    <way id="24315304" version="12" timestamp="2013-10-05T18:00:30Z" changeset="18199919" uid="113450" user="nfgusedautoparts">
        <nd ref="263599185"/>
        <tag k="source" v="county_import_v0.1_20080508235456"/>
        <tag k="boundary" v="administrative"/>
        <tag k="admin_level" v="6"/>
        <tag k="border_type" v="county"/>
    </way>

compared to the source http://www.openstreetmap.org/way/24315304 you can see that this way is having members truncated (1 node ref vs. 4 refs)

this is a similar issue to what I reported about the relations but not as severe, for road segments it's not a huge deal but for polygons it converts them to linestrings or cuts holes in them which is completely corrupting the data.

ways with a single node should definitely be removed from the extract as they are essentially just a single point, i'm not sure how GIS systems will handle this sort of thing, in this case it's actually supposed to be a county polygon.

new extracts

I created a custom extract 8ecf227e2bae by dragging the bbox to visually match the bbox of the 'popular extract' and it contained the same amount of errors as the old extract above.

geofabrik

I also tried running my script against a small extract from geofabrik and found that they are also including relation members in their country extracts but do not include the corresponding way entries in the file:

$ wget http://download.geofabrik.de/europe/andorra-latest.osm.pbf

$ shasum andorra-latest.osm.pbf 
56c4d3cc54d7e9fcda2c8706c00127691487d132  andorra-latest.osm.pbf
way 45356002 not found in extract
way 45356023 not found in extract
way 45356009 not found in extract
... many more
$ osmconvert --out-osm andorra-latest.osm.pbf | grep 45356002
        <member type="way" ref="45356002" role="outer"/>
        <member type="way" ref="45356002" role="outer"/>
        <member type="way" ref="45356002" role="outer"/>

... 3x references in a relation but no actual way data in the file
missinglink commented 7 years ago

I hate to be the bearer of bad news but it seems that both the mapzen and geofabrik extracts contain truncated or corrupted entities.

There is also a bunch of data that could be removed to thin them down, depending on the use-case.

I got frustrated, so I wrote a linter https://github.com/missinglink/pbflint which you can use to validate the extracts, it's not feature complete but it's good enough for now to find some of the invalid entities.

re: your comment @heffergm:

every time we seem to choose one way, we get complaints about having preferred the other

I know what you mean, it's probably down to the way that people use the data, for the purposes of display the truncated road segments are probably fine, it's possible that the road finishes slightly before the extract border but that's not a huge deal.

The thing for me is; for analytical and data science purposes the extracts are not valid because they mutate the source data, this is a big problem and there is no indication in the file which entities have been mutated and which have not.

I see a 4 different options when dealing with entity relationships:

  1. recurse and include all the things.
  2. remove nodes/ways which do not lie inside the bbox, do not update the way refs/relation members
  3. remove nodes/ways which do not lie inside the bbox, update the way refs/relation members
  4. remove nodes/ways which do not lie inside the bbox, remove any elements which reference them

no 1. would be my preference but due to the OOM errors this is probably not going to work at scale. no 2. is what geofabrik does, this bloats the file with a bunch of invalid elements and makes it hard to parse. no 3. is what mapzen does, this mutates and invalidates records, deletes edges from polygons and does not reproduce the data verbatim. no 4. would be a better solution as the entities inside would be valid and copied verbatim, the downside is that roads/ways/relations which exited the bbox would not be included at all.

thoughts?

heffergm commented 7 years ago

It sounds like you're the expert, so you tell me ;)

What I can point to is what we're doing now, which in the case of PBF extracts is this (I think we're talking about pbf, yes?): https://github.com/mapzen/chef-mapzen_odes/blob/master/templates/default/extracts.sh.erb#L5

Note --drop-broken-refs.

The short of it is, if osmconvert supports it, I can have it to whatever you'd prefer.

missinglink commented 7 years ago

ha! yea I'm becoming quite the expert :)

so you feed the exporter data/planet.o5m and then it cuts the extracts from that? is it possible that we cut from a postgres database export instead?

i'll have a look in to the settings for osmconvert and run the linter against the output, worst case scenario we have to write something custom.

heffergm commented 7 years ago

The short answer is no, not without completely re-architecting the entire thing...

We do currently use postgres, but only to produce shp files (the extracts we cut from the planet file are loaded into postgres then dumped as shapes).

missinglink commented 7 years ago

right, so I slept on this and I have figured out a way producing valid extracts using option 1. without having memory issues and which can cut small city extracts in under a minute.

the idea is to use option 1, which recurses and includes ways/relations that are only partially inside the bbox (with the exception of super relations) and then; since they are valid for all child bboxes, we can just use any larger extract to produce the smaller ones. Eg. use new_zealand to cut wellington.

@migurski if you want to take this further let me know and we can set up a video call