uchicago-capp-30320 / RouteRangers

Transit planning tool designed to retrieve end user input and also facilitate local transit authorities' decision making.
MIT License
1 stars 1 forks source link

[ingest bottleneck 1] Fix Pandas joins in handroll_geometrize_routes() or find other way to unify needed information #62

Closed mbjackson-capp closed 1 week ago

mbjackson-capp commented 1 week ago

This is taking place on branch actually_geometrize_and_ingest_to_postgres_transit_stations. The relevant function is handroll_geometrize_routes() in scripts/extract_scheduled_gtfs.py.

In order to ingest the TransitRoute data into the backend properly, we need to have all this information for each route* on a single row of a single GeoDataFrame:

Per the GTFS specification table relationships, the way to get this information joined is to join shapes with trips using shape_id as the key to join ON, and then to join trips with routes to get the route columns on. However, every attempt I have made to use pd.merge() to put the geometrized shapes together with the rest of the info has failed, resulting in a joined table where all columns from one of the source tables have every value replaced with NaN. This defeats the purpose of a join altogether.

If we can figure out how to make this merge work properly (or, alternately, just create a GeoDataFrame with the shapes and regular DataFrame with other info, that are guaranteed to have the same routes in the same order, and iterate through them both row-by-row at the same time to create TransitRoute objects on ingest), the main remaining bottleneck to us having route data for all three cities ingested into our backend will be cleared. Any help much appreciated!

*technically, each possible trip within the system. I think that for branching routes like the Green Line or Metra Electric, this results in several distinct LineStrings of the same color that would look unified when presented to user on frontend map.

mbjackson-capp commented 1 week ago

Diagnostic screenshots: diagnostic screenshot 1 diagnostic screenshot 2 diagnostic screenshot 3

mbjackson-capp commented 1 week ago

Hypotheses:

mbjackson-capp commented 1 week ago

working hypothesis: leading whitespace in routes_trips is absent in shapes, which means any join fails equality check on join key image

mbjackson-capp commented 1 week ago

It was that