In order to ingest the TransitRoute data into the backend properly, we need to have all this information for each route* on a single row of a single GeoDataFrame:
shape_id (exists in GTFS files trips.txt and shapes.txt)
route_id (exists in GTFS files trips.txt and routes.txt)
The other relevant columns from routes.txt that go into the data model (service_id, route_type, route_long_name, and route_color)
A LineString representing the coordinates of the transit line as it curves around on the Earth's surface (each row in the original shapes.txt represents a single point on such a LineString; the gtfs_kit library's method geometrize_shapes() produces a GeoDataFrame with one row for each shape_id and its geometry).
Per the GTFS specification table relationships, the way to get this information joined is to join shapes with trips using shape_id as the key to join ON, and then to join trips with routes to get the route columns on. However, every attempt I have made to use pd.merge() to put the geometrized shapes together with the rest of the info has failed, resulting in a joined table where all columns from one of the source tables have every value replaced with NaN. This defeats the purpose of a join altogether.
If we can figure out how to make this merge work properly (or, alternately, just create a GeoDataFrame with the shapes and regular DataFrame with other info, that are guaranteed to have the same routes in the same order, and iterate through them both row-by-row at the same time to create TransitRoute objects on ingest), the main remaining bottleneck to us having route data for all three cities ingested into our backend will be cleared. Any help much appreciated!
*technically, each possible trip within the system. I think that for branching routes like the Green Line or Metra Electric, this results in several distinct LineStrings of the same color that would look unified when presented to user on frontend map.
This is taking place on branch actually_geometrize_and_ingest_to_postgres_transit_stations. The relevant function is
handroll_geometrize_routes()
inscripts/extract_scheduled_gtfs.py
.In order to ingest the TransitRoute data into the backend properly, we need to have all this information for each route* on a single row of a single GeoDataFrame:
shape_id
(exists in GTFS filestrips.txt
andshapes.txt
)route_id
(exists in GTFS filestrips.txt
androutes.txt
)routes.txt
that go into the data model (service_id
,route_type
,route_long_name
, androute_color
)shapes.txt
represents a single point on such a LineString; thegtfs_kit
library's methodgeometrize_shapes()
produces a GeoDataFrame with one row for eachshape_id
and itsgeometry
).Per the GTFS specification table relationships, the way to get this information joined is to join
shapes
withtrips
usingshape_id
as the key to join ON, and then to jointrips
withroutes
to get the route columns on. However, every attempt I have made to usepd.merge()
to put the geometrized shapes together with the rest of the info has failed, resulting in a joined table where all columns from one of the source tables have every value replaced withNaN
. This defeats the purpose of a join altogether.If we can figure out how to make this merge work properly (or, alternately, just create a GeoDataFrame with the shapes and regular DataFrame with other info, that are guaranteed to have the same routes in the same order, and iterate through them both row-by-row at the same time to create
TransitRoute
objects on ingest), the main remaining bottleneck to us having route data for all three cities ingested into our backend will be cleared. Any help much appreciated!*technically, each possible trip within the system. I think that for branching routes like the Green Line or Metra Electric, this results in several distinct LineStrings of the same color that would look unified when presented to user on frontend map.