Open daniel-j-h opened 7 years ago
@daniel-j-h did you check the stop->station relation for 3/? Multiple stops (platforms) may represent the same station and will be named the same. Your look-up should probably only include stations, not stops.
Good point. No I did not, I simply worked with the stops.txt
file for the prototype to get something going quickly. I just wrote a Python 10 liner to do this, nothing fancy. We should definitely do it properly here.
What just came to my mind: we should check if we can use the locations in the GTFS feeds and query OSM for those locations in order to extract station information from OSM.
Some prior art for cleaning station names in OSM and Wikidata, respectively:
https://github.com/mapbox/mapbox-streets-source/blob/7675b6a8369a8a84e6354b89050be1a826fb6729/pgsql/lib.sql#L280-L284 mapbox/mapbox-streets-source#748
/cc @ajashton
I just had a look at the Berlin GTFS feeds for this year.
I can see three concrete issues:
1/ In there stations for the U-Bahn are named
U Alexanderplatz (Berlin)
, and other kind of stations e.g. bus lines have a different naming scheme. We probably don't want to show and store the(Berlin)
suffix (what about theU
prefix?) and want to associate a type with these stops. (Sidenote: other delimiters seem to be/
and extra information in brackets[x]
in this dataset!)2/ There are multiple stops with almost the same name in there, with some diffs being only the number of spaces in stop name. We probably should trim and collapse multiple spaces within stop names.
3/ There are multiple stops for each stop name in the data. We probably can deduplicate based on their location (e.g. haversine < 500m is probably the same stop). How should we handle cases where the name is the same but the location is different?
The issues above are not specifically for the Berlin GTFS feeds — there's probably more out there.
Related: