jamespfennell / transiter

Web service for transit data
https://demo.transiter.dev
MIT License
62 stars 7 forks source link

Load schedule/service static data, add shapes #119

Closed cedarbaum closed 1 year ago

cedarbaum commented 1 year ago

Overview

This change adds the ETL step for the below tables:

Services

Schedules

Shapes

Performance

Based off @jamespfennell's WIP schedule branch, copyfrom is used for inserting scheduled trips and scheduled stop times. There may be other queries that can be optimized similarly.

Testing

Tests are added in static_test.go.

jamespfennell commented 1 year ago

Overall looks pretty straightforward to me (if tedious to write I bet!).

One difference between this WIP and my prototyping in https://github.com/jamespfennell/transiter/tree/schedule is that in the prototype I don't bother updating services or scheduled trips. Instead, during each GTFS static update I just delete the services first (which deletes the scheduled trips and stop times via ON DELETE CASCADE) and then insert them from scratch.

I think there's a few reasons why this may be worth considering.

Firstly, no other entities in the schema depend on the schedule so deleting the schedule doesn't mess up other entities. Conversely, we could not do delete+insert for routes, because if we delete a route the realtime trips in that route also get deleted (again via ON DELETE CASCADE). For routes we need to update in place to preserve all the child entities that depend on the route. But for the schedule, we don't have this constraint.

Secondly, I think deleting and re-inserting is going to be quite faster than updating if we're able to use COPY FROM for the inserts. (I guess a way to test this theory would be to download an old version of the subway's GTFS static data from Transitland and then write some code that manually performs an update using this old data.)

Finally, it would make the code a bit simpler.

cedarbaum commented 1 year ago

This makes a lot of sense, thanks! The latest commits only deletes all services that are either (1) in the feed or (2) have a system_pk/id combination contained in the incoming entities. I think case (2) was maybe alluded to in your branch as a TODO if I understood that correctly. Everything else should be taken care of by cascading deletes (except for shapes, which is handled separately).

I think this still needs a couple more iterations and additional manual testing with real systems. Let me know your thoughts on continuing to iterate on this PR vs combining it with your work so far - I am happy to go with either approach.

jamespfennell commented 1 year ago

So far this looks great to me! I don't really have any suggestions, it seems like it's doing the right thing. I'm guessing the performance on the NYC buses is about 2 minutes?

Definitely think we should go with this PR rather than my prototype - seems like this PR is a strict superset at this point :).

cedarbaum commented 1 year ago

Sounds good! I went ahead and added another commit where I use copyfrom for inserting scheduled trips as well. This appears to be marginally faster on my machine, though not as significant of an improvement as I had hoped for the bus system unfortunately.

I removed the WIP label and think this is ready for a full review when you have time.

jamespfennell commented 1 year ago

Yeah unfortunately the time to update is 80% inserting stop times, so any optimizations one does for other entities probably won't have a huge effect on the overall timings. Still, even small improvements are nice.

The PR looks great to me, I would be happy for you to (squash) merge it in.

jamespfennell commented 1 year ago

FYI left a comment on https://github.com/jamespfennell/transiter/issues/11 about designing an API around this newly persisted data, which may be tricky.