support importing >1 GTFS datasets into one Postgres DB

danthonywalker commented 1 year ago

In the documentation comparing this project to gtfs-sql-importer it says this:

While gtfs-sql-importer is designed to import multiple versions of a GTFS dataset in an idempotent fashion, gtfs-via-postgres assumes that one (version of a) GTFS dataset is imported into one DB exactly once.

I'd like to use multiple GTFS datasets for my application. It looks like my only real means of accomplishing this would be to create multiple databases, 1 for each dataset.

Would it be possible to optionally pass an argument giving the dataset an ID, thus allowing multiple datasets to co-exist in the same database? This would massively simplify some infrastructure needs for my application.

Edit: node-gtfs allows using a prefix to prepend onto IDs to ensure uniqueness across different agencies. Would it be possible to do this with this project?

Side question: What happens when you try to import different versions of the same dataset into the same database currently? Does a reimport clear previous data?

derhuerst commented 1 year ago

I'd like to use multiple GTFS datasets for my application. It looks like my only real means of accomplishing this would be to create multiple databases, 1 for each dataset.

Indeed, this is the case right now. – Note that "database" here refers to a database within a PostgreSQL database server (RDBMS), and one PostgreSQL server can contain multiple databases simultaneously.

This design has some major advantages:

By assuming that the to-be-imported-into DB does not already contain existing GTFS data, gtfs-via-postgres doesn't have to query the database in any way, it only has to write SQL statements to stdout. This design makes the code much simpler.
Because all "internal structures" (enums a.k.a. custom types, helper functions, etc.) get populated from scratch during (each) import, a future gtfs-via-postgres@4 version does not have to keep them backwards-compatible with prior gtfs-via-postgres@4 versions. (With the "public" consumer-facing interface, I still promise and try to follow semantic versioning of course!)
The SQL generated by gtfs-via-postgres can be a lot simpler (and thus likely with fewer defects and edge cases), as it does not have to work in an idempotent fashion.

However, the use case of comparing GTFS datasets (or computing aggregated data from them) seems to be important enough to warrant additional complexity in gtfs-via-postgres.

(Even now, the one-DB-per-GTFS advantages listed above have a major caveat: Even the current design has one flaw where the generated SQL is not DB-scoped, because with --postgraphile and/or --postgrest, gtfs-via-postgres creates roles, which AFAIK exist outside of individual databases.)

Would it be possible to optionally pass an argument giving the dataset an ID, thus allowing multiple datasets to co-exist in the same database? This would massively simplify some infrastructure needs for my application.

Edit: node-gtfs allows using a prefix to prepend onto IDs to ensure uniqueness across different agencies. Would it be possible to do this with this project?

I strongly hesitate to modify IDs in the GTFS data. I would like gtfs-via-postgres to keep an "open world design", where one can easily import additional files & fields – possibly proprietary (e.g. from NTFS), or an official GTFS feature that gtfs-via-postgres doesn't support (yet). To support such extensions, GTFS IDs need to be imported as-is.

However, I have an alternative technical approach in mind allowing multiple GTFS feeds per DB: gtfs-via-postgres can already import into a schema (which is similar to a namespace) other than the default one (public). This feature is currently intended to be used in combination with --postgraphile & --postgrest, allowing users to only expose the "public" consumer-facing interface via GraphQL or REST, respectively. Why not ensure that >1 GTFS imports can co-exist, each in its own schema?

(If you know of other possible designs, don't hesitate to mention them here! I haven't thought about this topic much so far.)

Side question: What happens when you try to import different versions of the same dataset into the same database currently? Does a reimport clear previous data?

Currently, re-importing will fail because the generated SQL is not idempotent; It will fail because some "structure" (function, table, etc.) has already been created.

derhuerst commented 1 year ago

cc @hbruch

danthonywalker commented 1 year ago

However, I have an alternative technical approach in mind allowing multiple GTFS feeds per DB: gtfs-via-postgres can already import into a schema (which is similar to a namespace) other than the default one (public). This feature is currently intended to be used in combination with --postgraphile & --postgrest, allowing users to only expose the "public" consumer-facing interface via GraphQL or REST, respectively. Why not ensure that >1 GTFS imports can co-exist, each in its own schema?

This seems like a reasonable approach to the problem within the constraints and explanations you have given. I think it would work.

derhuerst commented 1 year ago

I think gtfs-via-postgres should prevent doing two imports (into separate schemas) with different versions, defaulting to the safe side of this trade-off rather than exposing users to arbitrarily subtle compatibility problems (e.g. some table having either 0 or >0 entries). My proposal is to only allow >1 imports with the exact same version of gifs-via-postgres. What do you think?

danthonywalker commented 1 year ago

This would work for my application since I manage my databases with Flyway. If I need to update gtfs-via-postgres I can just have it so on deploys it drops the previous schemas and the application will rebuild them.

smohiudd commented 1 year ago

However, I have an alternative technical approach in mind allowing multiple GTFS feeds per DB: gtfs-via-postgres can already import into a schema (which is similar to a namespace) other than the default one (public).

I like the idea of multiple schemas to support multiple gtfs imports. Does the current version of gtfs-via-postgres support this already? Or will specifying a schema still create a new db?

derhuerst commented 1 year ago

Currently, importing into two different schemas will fail because some things are being created outside the schema in a non-idempotent way.

derhuerst commented 9 months ago

I have finally published the support for >1 databases in gtfs-via-postgres@4.9.0. 🎉

public-transport / gtfs-via-postgres

support importing >1 GTFS datasets into one Postgres DB #51