How to import programmatically re-import changing datasets in production?

derhuerst commented 8 months ago

@dancesWithCycles asked in the berlin-gtfs-rt-server project (which transitively depends on gtfs-via-postgres) how to programmatically import GTFS and switch some running service (e.g. an API) to the newly imported data.

I'll explain the experience I have made with different approaches here. Everyone is very welcome giving theirs and discussing the trade-offs!

why the import needs to be (more or less) atomic

From https://github.com/derhuerst/berlin-gtfs-rt-server/issues/9#issuecomment-1942333891:

An alternative approach would be a script that cleans up an existing database without dropping it so that the update happens on a clean database.

With this design, if your script crashes after it has cleaned the DB, you'll leave your service in a non-functional state. Also, even if it runs through, you'll have an unpredictable period of downtime.

separate DBs

I am using a Managed Server where I do not want to drop and create a database every time I update the GTFS feed. I rather drop and create the respective schema.

At the end of the day I need to make sure to prepare a fresh environment for the GTFS feed import into PostgreSQL without dropping the database. How would you do it?

Recently, in postgis-gtfs-importer, I tackled the problem differently by using >1 DBs:

For each GTFS import, a new DB named gtfs_$unix_timestamp gets created, and data is imported into it.
After a successful import, the newly created DB is marked as the latest in a special "bookkeeping" DB.
Before each import, all import DBs other than the latest two are deleted.
This whole process is done as atomically as PostgreSQL allows, by combining a transaction and an exclusive lock.

One problem remains: The consuming program then needs to connect to a DB with a dynamic name. Because at MobiData BW IPL, we have PgBouncer in place anyways, we use it to "alias" this dynamic DB into a stable name (e.g. gtfs). There are a lot of gotchas involved here though.

TLDR: If you do have the option to programmatically create PostgreSQL DBs, for now I recommend using this tool or process. Otherwise, consider other options.

separate schemas

Now that gtfs-via-postgres has gained the ability to import >1 GTFS datasets into 1 DB with version 4.9.0, one could also adapt the aforementioned import process to use separate schemas instead of separate DBs.

I see the following advantages:

Hosting environments where creating an arbitrary number of DBs is not allowed are supported.
We can get rid of the whole PgBouncer hassle (see above).
A schema can be created or deleted within a transaction, so the process is truly atomic. It seems that the "bookkeeping" DB and the exclusive lock wouldn't be necessary anymore.

However, there are disadvantages:

As outlined in https://github.com/public-transport/gtfs-via-postgres/issues/51#issuecomment-1751700337, importing >1 datasets with different versions of gtfs-via-postgres won't be possible.
If there was a bug in the logic cleaning up old imports, they would break the running service, e.g. if they delete parts of the live/latest schema.

derhuerst commented 8 months ago

derhuerst commented 6 days ago

https://github.com/mobidata-bw/postgis-gtfs-importer can be one building block to tackle this problem, but – given PostgreSQL's awkward programmatic handling of databases – it is quite complex.

public-transport / gtfs-via-postgres