Closed danthonywalker closed 11 months ago
I'd like to use multiple GTFS datasets for my application. It looks like my only real means of accomplishing this would be to create multiple databases, 1 for each dataset.
Indeed, this is the case right now. – Note that "database" here refers to a database within a PostgreSQL database server (RDBMS), and one PostgreSQL server can contain multiple databases simultaneously.
This design has some major advantages:
gtfs-via-postgres
doesn't have to query the database in any way, it only has to write SQL statements to stdout
. This design makes the code much simpler.gtfs-via-postgres@4
version does not have to keep them backwards-compatible with prior gtfs-via-postgres@4
versions. (With the "public" consumer-facing interface, I still promise and try to follow semantic versioning of course!)gtfs-via-postgres
can be a lot simpler (and thus likely with fewer defects and edge cases), as it does not have to work in an idempotent fashion.However, the use case of comparing GTFS datasets (or computing aggregated data from them) seems to be important enough to warrant additional complexity in gtfs-via-postgres
.
(Even now, the one-DB-per-GTFS advantages listed above have a major caveat: Even the current design has one flaw where the generated SQL is not DB-scoped, because with --postgraphile
and/or --postgrest
, gtfs-via-postgres
creates roles, which AFAIK exist outside of individual databases.)
Would it be possible to optionally pass an argument giving the dataset an ID, thus allowing multiple datasets to co-exist in the same database? This would massively simplify some infrastructure needs for my application.
Edit: node-gtfs allows using a prefix to prepend onto IDs to ensure uniqueness across different agencies. Would it be possible to do this with this project?
I strongly hesitate to modify IDs in the GTFS data. I would like gtfs-via-postgres
to keep an "open world design", where one can easily import additional files & fields – possibly proprietary (e.g. from NTFS), or an official GTFS feature that gtfs-via-postgres
doesn't support (yet). To support such extensions, GTFS IDs need to be imported as-is.
However, I have an alternative technical approach in mind allowing multiple GTFS feeds per DB: gtfs-via-postgres
can already import into a schema (which is similar to a namespace) other than the default one (public
). This feature is currently intended to be used in combination with --postgraphile
& --postgrest
, allowing users to only expose the "public" consumer-facing interface via GraphQL or REST, respectively. Why not ensure that >1 GTFS imports can co-exist, each in its own schema?
(If you know of other possible designs, don't hesitate to mention them here! I haven't thought about this topic much so far.)
Side question: What happens when you try to import different versions of the same dataset into the same database currently? Does a reimport clear previous data?
Currently, re-importing will fail because the generated SQL is not idempotent; It will fail because some "structure" (function, table, etc.) has already been created.
cc @hbruch
However, I have an alternative technical approach in mind allowing multiple GTFS feeds per DB: gtfs-via-postgres can already import into a schema (which is similar to a namespace) other than the default one (public). This feature is currently intended to be used in combination with --postgraphile & --postgrest, allowing users to only expose the "public" consumer-facing interface via GraphQL or REST, respectively. Why not ensure that >1 GTFS imports can co-exist, each in its own schema?
This seems like a reasonable approach to the problem within the constraints and explanations you have given. I think it would work.
I think gtfs-via-postgres
should prevent doing two imports (into separate schemas) with different versions, defaulting to the safe side of this trade-off rather than exposing users to arbitrarily subtle compatibility problems (e.g. some table having either 0 or >0 entries). My proposal is to only allow >1 imports with the exact same version of gifs-via-postgres
. What do you think?
This would work for my application since I manage my databases with Flyway. If I need to update gtfs-via-postgres
I can just have it so on deploys it drops the previous schemas and the application will rebuild them.
However, I have an alternative technical approach in mind allowing multiple GTFS feeds per DB: gtfs-via-postgres can already import into a schema (which is similar to a namespace) other than the default one (public).
I like the idea of multiple schemas to support multiple gtfs imports. Does the current version of gtfs-via-postgres
support this already? Or will specifying a schema still create a new db?
Currently, importing into two different schemas will fail because some things are being created outside the schema in a non-idempotent way.
I have finally published the support for >1 databases in gtfs-via-postgres@4.9.0
. 🎉
In the documentation comparing this project to
gtfs-sql-importer
it says this:I'd like to use multiple GTFS datasets for my application. It looks like my only real means of accomplishing this would be to create multiple databases, 1 for each dataset.
Would it be possible to optionally pass an argument giving the dataset an ID, thus allowing multiple datasets to co-exist in the same database? This would massively simplify some infrastructure needs for my application.
Edit: node-gtfs allows using a prefix to prepend onto IDs to ensure uniqueness across different agencies. Would it be possible to do this with this project?
Side question: What happens when you try to import different versions of the same dataset into the same database currently? Does a reimport clear previous data?