[ingest bottleneck 1] Fix Pandas joins in handroll_geometrize_routes() or find other way to unify needed information

mbjackson-capp commented 1 week ago

This is taking place on branch actually_geometrize_and_ingest_to_postgres_transit_stations. The relevant function is handroll_geometrize_routes() in scripts/extract_scheduled_gtfs.py.

In order to ingest the TransitRoute data into the backend properly, we need to have all this information for each route* on a single row of a single GeoDataFrame:

shape_id (exists in GTFS files trips.txt and shapes.txt)
route_id (exists in GTFS files trips.txt and routes.txt)
The other relevant columns from routes.txt that go into the data model (service_id, route_type, route_long_name, and route_color)
A LineString representing the coordinates of the transit line as it curves around on the Earth's surface (each row in the original shapes.txt represents a single point on such a LineString; the gtfs_kit library's method geometrize_shapes() produces a GeoDataFrame with one row for each shape_id and its geometry).

Per the GTFS specification table relationships, the way to get this information joined is to join shapes with trips using shape_id as the key to join ON, and then to join trips with routes to get the route columns on. However, every attempt I have made to use pd.merge() to put the geometrized shapes together with the rest of the info has failed, resulting in a joined table where all columns from one of the source tables have every value replaced with NaN. This defeats the purpose of a join altogether.

If we can figure out how to make this merge work properly (or, alternately, just create a GeoDataFrame with the shapes and regular DataFrame with other info, that are guaranteed to have the same routes in the same order, and iterate through them both row-by-row at the same time to create TransitRoute objects on ingest), the main remaining bottleneck to us having route data for all three cities ingested into our backend will be cleared. Any help much appreciated!

*technically, each possible trip within the system. I think that for branching routes like the Green Line or Metra Electric, this results in several distinct LineStrings of the same color that would look unified when presented to user on frontend map.

mbjackson-capp commented 1 week ago

Diagnostic screenshots:

mbjackson-capp commented 1 week ago

Hypotheses:

Something about trying to merge a GeoDataFrame and a regular DataFrame gets rejected
some of the data that should be the same is subtly of different data type so join fails to find them
I did pd.merge() syntax wrong

mbjackson-capp commented 1 week ago

working hypothesis: leading whitespace in routes_trips is absent in shapes, which means any join fails equality check on join key

mbjackson-capp commented 1 week ago

It was that

uchicago-capp-30320 / RouteRangers

[ingest bottleneck 1] Fix Pandas joins in handroll_geometrize_routes() or find other way to unify needed information #62