mapbox / nepomuk

A public transit router for GTFS feeds (currently only static) written in modern c++
MIT License
24 stars 8 forks source link

Extractor: clean up and normalize station names #52

Open daniel-j-h opened 7 years ago

daniel-j-h commented 7 years ago

I just had a look at the Berlin GTFS feeds for this year.

https://daten.berlin.de/datensaetze/vbb-fahrplandaten-januar-2017-bis-dezember-2017 stops.txt, CC-BY 3.0 licensed: http://www.vbb.de/de/datei/GTFS_VBB_Jan_Dez2017.zip

I can see three concrete issues:

1/ In there stations for the U-Bahn are named U Alexanderplatz (Berlin), and other kind of stations e.g. bus lines have a different naming scheme. We probably don't want to show and store the (Berlin) suffix (what about the U prefix?) and want to associate a type with these stops. (Sidenote: other delimiters seem to be / and extra information in brackets [x] in this dataset!)

2/ There are multiple stops with almost the same name in there, with some diffs being only the number of spaces in stop name. We probably should trim and collapse multiple spaces within stop names.

3/ There are multiple stops for each stop name in the data. We probably can deduplicate based on their location (e.g. haversine < 500m is probably the same stop). How should we handle cases where the name is the same but the location is different?

The issues above are not specifically for the Berlin GTFS feeds — there's probably more out there.

Related:

MoKob commented 7 years ago

@daniel-j-h did you check the stop->station relation for 3/? Multiple stops (platforms) may represent the same station and will be named the same. Your look-up should probably only include stations, not stops.

daniel-j-h commented 7 years ago

Good point. No I did not, I simply worked with the stops.txt file for the prototype to get something going quickly. I just wrote a Python 10 liner to do this, nothing fancy. We should definitely do it properly here.

daniel-j-h commented 7 years ago

What just came to my mind: we should check if we can use the locations in the GTFS feeds and query OSM for those locations in order to extract station information from OSM.

1ec5 commented 7 years ago

Some prior art for cleaning station names in OSM and Wikidata, respectively:

https://github.com/mapbox/mapbox-streets-source/blob/7675b6a8369a8a84e6354b89050be1a826fb6729/pgsql/lib.sql#L280-L284 mapbox/mapbox-streets-source#748

/cc @ajashton