@dancesWithCycles asked in the berlin-gtfs-rt-server project (which transitively depends on gtfs-via-postgres) how to programmatically import GTFS and switch some running service (e.g. an API) to the newly imported data.
I'll explain the experience I have made with different approaches here. Everyone is very welcome giving theirs and discussing the trade-offs!
An alternative approach would be a script that cleans up an existing database without dropping it so that the update happens on a clean database.
With this design, if your script crashes after it has cleaned the DB, you'll leave your service in a non-functional state. Also, even if it runs through, you'll have an unpredictable period of downtime.
separate DBs
I am using a Managed Server where I do not want to drop and create a database every time I update the GTFS feed. I rather drop and create the respective schema.
At the end of the day I need to make sure to prepare a fresh environment for the GTFS feed import into PostgreSQL without dropping the database. How would you do it?
Recently, in postgis-gtfs-importer, I tackled the problem differently by using >1 DBs:
For each GTFS import, a new DB named gtfs_$unix_timestamp gets created, and data is imported into it.
After a successful import, the newly created DB is marked as the latest in a special "bookkeeping" DB.
Before each import, all import DBs other than the latest two are deleted.
This whole process is done as atomically as PostgreSQL allows, by combining a transaction and an exclusive lock.
One problem remains: The consuming program then needs to connect to a DB with a dynamic name. Because at MobiData BW IPL, we have PgBouncer in place anyways, we use it to "alias" this dynamic DB into a stable name (e.g. gtfs). There are a lot of gotchas involved here though.
TLDR: If you do have the option to programmatically create PostgreSQL DBs, for now I recommend using this tool or process. Otherwise, consider other options.
Hosting environments where creating an arbitrary number of DBs is not allowed are supported.
We can get rid of the whole PgBouncer hassle (see above).
A schema can be created or deleted within a transaction, so the process is truly atomic. It seems that the "bookkeeping" DB and the exclusive lock wouldn't be necessary anymore.
@dancesWithCycles asked in the
berlin-gtfs-rt-server
project (which transitively depends ongtfs-via-postgres
) how to programmatically import GTFS and switch some running service (e.g. an API) to the newly imported data.I'll explain the experience I have made with different approaches here. Everyone is very welcome giving theirs and discussing the trade-offs!
why the import needs to be (more or less) atomic
From https://github.com/derhuerst/berlin-gtfs-rt-server/issues/9#issuecomment-1942333891:
With this design, if your script crashes after it has cleaned the DB, you'll leave your service in a non-functional state. Also, even if it runs through, you'll have an unpredictable period of downtime.
separate DBs
Recently, in
postgis-gtfs-importer
, I tackled the problem differently by using >1 DBs:gtfs_$unix_timestamp
gets created, and data is imported into it.One problem remains: The consuming program then needs to connect to a DB with a dynamic name. Because at MobiData BW IPL, we have PgBouncer in place anyways, we use it to "alias" this dynamic DB into a stable name (e.g.
gtfs
). There are a lot of gotchas involved here though.TLDR: If you do have the option to programmatically create PostgreSQL DBs, for now I recommend using this tool or process. Otherwise, consider other options.
separate schemas
Now that
gtfs-via-postgres
has gained the ability to import >1 GTFS datasets into 1 DB with version 4.9.0, one could also adapt the aforementioned import process to use separate schemas instead of separate DBs.I see the following advantages:
However, there are disadvantages:
gtfs-via-postgres
won't be possible.