public-transport / gtfs-via-postgres

Process GTFS Static/Schedule by importing it into a PostgreSQL database.
https://github.com/derhuerst/gtfs-via-postgres#gtfs-via-postgres
Other
93 stars 18 forks source link

How to import programmatically re-import changing datasets in production? #57

Open derhuerst opened 8 months ago

derhuerst commented 8 months ago

@dancesWithCycles asked in the berlin-gtfs-rt-server project (which transitively depends on gtfs-via-postgres) how to programmatically import GTFS and switch some running service (e.g. an API) to the newly imported data.

I'll explain the experience I have made with different approaches here. Everyone is very welcome giving theirs and discussing the trade-offs!

why the import needs to be (more or less) atomic

From https://github.com/derhuerst/berlin-gtfs-rt-server/issues/9#issuecomment-1942333891:

An alternative approach would be a script that cleans up an existing database without dropping it so that the update happens on a clean database.

With this design, if your script crashes after it has cleaned the DB, you'll leave your service in a non-functional state. Also, even if it runs through, you'll have an unpredictable period of downtime.

separate DBs

I am using a Managed Server where I do not want to drop and create a database every time I update the GTFS feed. I rather drop and create the respective schema.

At the end of the day I need to make sure to prepare a fresh environment for the GTFS feed import into PostgreSQL without dropping the database. How would you do it?

Recently, in postgis-gtfs-importer, I tackled the problem differently by using >1 DBs:

One problem remains: The consuming program then needs to connect to a DB with a dynamic name. Because at MobiData BW IPL, we have PgBouncer in place anyways, we use it to "alias" this dynamic DB into a stable name (e.g. gtfs). There are a lot of gotchas involved here though.

TLDR: If you do have the option to programmatically create PostgreSQL DBs, for now I recommend using this tool or process. Otherwise, consider other options.

separate schemas

Now that gtfs-via-postgres has gained the ability to import >1 GTFS datasets into 1 DB with version 4.9.0, one could also adapt the aforementioned import process to use separate schemas instead of separate DBs.

I see the following advantages:

However, there are disadvantages:

derhuerst commented 8 months ago

related: https://github.com/mobidata-bw/ipl-orchestration/issues/8

derhuerst commented 6 days ago

https://github.com/mobidata-bw/postgis-gtfs-importer can be one building block to tackle this problem, but – given PostgreSQL's awkward programmatic handling of databases – it is quite complex.