Closed cedarbaum closed 1 year ago
Overall looks pretty straightforward to me (if tedious to write I bet!).
One difference between this WIP and my prototyping in https://github.com/jamespfennell/transiter/tree/schedule is that in the prototype I don't bother updating services or scheduled trips. Instead, during each GTFS static update I just delete the services first (which deletes the scheduled trips and stop times via ON DELETE CASCADE
) and then insert them from scratch.
I think there's a few reasons why this may be worth considering.
Firstly, no other entities in the schema depend on the schedule so deleting the schedule doesn't mess up other entities. Conversely, we could not do delete+insert for routes, because if we delete a route the realtime trips in that route also get deleted (again via ON DELETE CASCADE
). For routes we need to update in place to preserve all the child entities that depend on the route. But for the schedule, we don't have this constraint.
Secondly, I think deleting and re-inserting is going to be quite faster than updating if we're able to use COPY FROM
for the inserts. (I guess a way to test this theory would be to download an old version of the subway's GTFS static data from Transitland and then write some code that manually performs an update using this old data.)
Finally, it would make the code a bit simpler.
This makes a lot of sense, thanks! The latest commits only deletes all services that are either (1) in the feed or (2) have a system_pk
/id
combination contained in the incoming entities. I think case (2) was maybe alluded to in your branch as a TODO if I understood that correctly. Everything else should be taken care of by cascading deletes (except for shapes, which is handled separately).
I think this still needs a couple more iterations and additional manual testing with real systems. Let me know your thoughts on continuing to iterate on this PR vs combining it with your work so far - I am happy to go with either approach.
So far this looks great to me! I don't really have any suggestions, it seems like it's doing the right thing. I'm guessing the performance on the NYC buses is about 2 minutes?
Definitely think we should go with this PR rather than my prototype - seems like this PR is a strict superset at this point :).
Sounds good! I went ahead and added another commit where I use copyfrom
for inserting scheduled trips as well. This appears to be marginally faster on my machine, though not as significant of an improvement as I had hoped for the bus system unfortunately.
I removed the WIP label and think this is ready for a full review when you have time.
Yeah unfortunately the time to update is 80% inserting stop times, so any optimizations one does for other entities probably won't have a huge effect on the overall timings. Still, even small improvements are nice.
The PR looks great to me, I would be happy for you to (squash) merge it in.
FYI left a comment on https://github.com/jamespfennell/transiter/issues/11 about designing an API around this newly persisted data, which may be tricky.
Overview
This change adds the ETL step for the below tables:
Services
scheduled_service
scheduled_service_addition
scheduled_service_removal
Schedules
scheduled_trip
wheelchair_accessible
is converted from string to nullable boolbikes_allowed
is converted from string to nullable boolshape_pk
foreign key relation with newshape
tablescheduled_trip_frequency
start_time
andend_time
to int types, since PostgreSQL'sTime
type is not capable of overflowing to the next day, which is required for GTFS times.scheduled_trip_stop_time
arrival_time
anddeparture_time
to int types, since PostgreSQL'sTime
type is not capable of overflowing to the next day, which is required for GTFS times.Shapes
shape
(NEW)Performance
Based off @jamespfennell's WIP
schedule
branch,copyfrom
is used for inserting scheduled trips and scheduled stop times. There may be other queries that can be optimized similarly.Testing
Tests are added in
static_test.go
.