Open Subzidion opened 5 years ago
It is called SQL ;-) and I think it does exists via GTFSdb.
GTFSdb is all Python, nothing that's just SQL. I was thinking something more along the lines of what OneBusAway does with Hibernate, but usable in any language.
GTFSdb produces SQL tables in different flavors. You could use that in your agnostic definition.
Using GTFSdb for this would still requiring updating the Python code to reflect any changes to the spec, running it to create your database schema, then using some other tool to map from the schema to classes in whatever language you want to use. Feels a bit cumbersome. I understand there's probably some class representation for most languages already created, but needing to check and hope they get updated if the spec gets changed seems annoying. If there was some definition, similar to the Hibernate one, that could be used in any language, it would make any of those language-specific GTFS-Static ORMs a lot easier.
@Subzidion There isn't anything official, but the closest thing I'm aware of in concept to what you're looking for is this Data Package specification:
I started generalizing this to any GTFS: https://github.com/CUTR-at-USF/GTFS
I think I have some work stashed somewhere beyond what's currently in the above branch...
This Data Package Specification is exactly what I was looking for. Is there any way we can make this specification a part of the main GTFS package? I would think defining GTFS in terms of the JSON schema would help clarify ambiguity instead of attempting to dfeine the JSON schema from markdown.
@Subzidion It's certainly possible.
Is there anyone else interested in this type of programmatically-readable schema definition for GTFS?
Note that there is a proposal and discussion related to GTFS schemas happening at https://github.com/google/transit/pull/244.
Similar to the discussion in #244, I strongly bias towards achieving consensus on the problem statement before proposing a standard.
It sounds like the core objective is to further codify GTFS to meet these criteria (broken into bullets for easier visual parsing):
It would be helpful to know if that's an accurate characterization, or if criterial like human-readable or CSV-specific need to be appended.
@devadvance:
criterial like human-readable
I think human readability is an important consideration for transparency and maintainability since changes to the spec will be represented and discussed within the context of a pull-request and vote.
such that GFTS data can be processed and stored
AND
I am new to this community, so there are likely many nuances that I am missing. However I have some questions and comments about this issue. For some background I am working, in collaboration with CALTRANS, on an application to implement software to support V1 of the GTFS "Grading Standard."
A canonical, machine-readable version of the standard would be most helpful. I would like to be able to abstract away as much of the standard as possible from this application, so I don't need to make updates to this application each time the standard changes. Having a machine readable version of the standard, with at least an agreed upon format\structure, defined types, and enumerations, would be central to this goal.
So I have a couple of questions:
I also wanted to voice my support for developing a canonical JSON-Schema for the following reasons: it has already been developed (assuming the draft is up to date and there aren't any significant issues with it), it is the most widely used of the proposed standards, and to my knowledge it supports most of objectives that have been mentioned so far. However as I said there are likely many nuances that I am missing.
More importantly, I wanted to express that not having a standard machine readable standard creates a significant issue right at the beginning of any new development effort: How do I model the standard and how do I keep that model up to date as it evolves? Providing a schema file would help alleviate many of these issues and free up developer time for other work.
@wesleyi23 do you have any additions/mods to @devadvance 's summary of the problem statement ?
@e-lo and @devadvance the only thing, I would add is that it would be nice if the solution not only included the correct structure, syntax, and relationships of GTFS static data, but also the file and field descriptions. I would make a pitch that these be represented in an HTML format, because there are some order lists, paragraphs, and other similar items. I think this would provide a more or less complete reproduction of the current standard documents.
Does anyone have any additions/edits to the following problem statement?
BTW - I saw that @stephen-gates started developing a Frictionless data package for GTFS and would be curious why it seems to have been abandoned?
@e-lo My understanding is that https://github.com/Stephen-Gates/GTFS was created specifically for validating the South East Queensland GTFS data and was never intended to be a canonical schema for the general GTFS spec. For example, some of the location constraints defined for stop location lat/longs are specific to Queensland.
I started expanding Stephen's work in this branch to represent the entire spec a while back, but other priorities pulled my attention away: https://github.com/CUTR-at-USF/GTFS/tree/full-spec
You can see my changes in these two commits:
Here were the remaining TODOs I noted in 2016:
- Review TODOs and FIXMEs - some constraints will break extensibility
- Add missing files
If someone would want to pick up this work I'd certainly welcome the contribution.
@barbeau Awesome and thanks for background. In your opinion is frictionless "the right" spec for achieving the objectives above? my main hesitation is lack of progress/movement recently in the organization.
@wesleyi23 It seems like Sean's repo is a good place to start.
@e-lo It looked very promising to me, and the above work was mainly an experiment to see if it panned out. Unfortunately I don't have any experience with frictionless outside of the above so I can't say for sure.
@e-lo @barbeau At the moment I have a need for a schema document, so I am happy to put time in to developing one further.
Sean I reviewed your repo and I agree it could be a good place to start. There is also @LeoFrachet JSON-Schema file referenced in #244 which would also make a good starting place. As a new community member, I have no context or background to weigh the pros or cons of JSON-Schema vs Frictionless.
From a technical perspective they both appear to meet the the satisfy the identified problem statement, unless there is something I am missing.
Any guidance on which path to follow would be greatly appreciated.
Why not just use a SQL database definition file? This can include unique constraints, foreign key constraints, enums, and so on.
GTFS static essentially defines a relational database schema; so to me it would seem the most natural choice but I may be missing something.
On Mon, Nov 30, 2020 at 12:53 PM wesleyi23 notifications@github.com wrote:
@e-lo https://github.com/e-lo @barbeau https://github.com/barbeau At the moment I have a need for a schema document, so I am happy to put time in to developing one further.
Sean I reviewed your repo and I agree it could be a good place to start. There is also @LeoFrachet https://github.com/LeoFrachet JSON-Schema file referenced in #244 https://github.com/google/transit/pull/244 which would also make a good starting place. As a new community member, I have no context or background to weigh the pros or cons of JSON-Schema vs Frictionless.
From a technical perspective they both appear to meet the the satisfy the identified problem statement, unless there is something I am missing.
Any guidance on which path to follow would be greatly appreciated.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/transit/issues/127#issuecomment-735942162, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4T7SRT5VN2M6D2HA3M6UTSSPLYVANCNFSM4GJIC4QQ .
@jamespfennell one reason could be that some constraints are "OR".
@jamespfennell : To my knowledge SQL definitions aren't designed be 'read in' as data other than for SQL – but I would be curious if somebody more familiar with various options could evaluate this option vis-a-vis the 10 points above.
As a new community member, I have no context or background to weigh the pros or cons of JSON-Schema vs Frictionless.
My concern with the JSON-Schema is that we'd be introducing an entirely new encoding-specific concept to GTFS that doesn't currently exist there. I think it would also tempt some producers and consumers to "JSON-ize" GTFS data, and I see that further complicating an already complex ecosystem.
Frictionless Table Schema format was designed to represent tabular data, which is the current representation/encoding of static GTFS data (CSV files in a ZIP file). IMHO it seems a better fit to the existing GTFS spec, unless there is a limitation that that I don't know of.
Why not just use a SQL database definition file? This can include unique constraints, foreign key constraints, enums, and so on.
@jamespfennell Could you give me an example for a table in GTFS?
As @skinkie says there are some situations that won't be easy to model, like service_id
in calendar_dates.txt, which in some cases is a primary key but in others is a foreign key (potentially within the same GTFS dataset, as evidenced by https://github.com/MobilityData/gtfs-validator/issues/397):
https://github.com/google/transit/blob/master/gtfs/spec/en/reference.md#calendar_datestxt
After talking with folks, I am starting work to update and expand the Frictionless Schema, Stephen created for Queensland. I have forked @barbeau branch and will be working on here: https://github.com/wesleyi23/GTFS-Frictionless.
Thank you to @wesleyi23 for creating a fairly complete definition of GTFS here: https://github.com/wesleyi23/GTFS-Frictionless
It would be great if all who are interested could add issues, contribute to, and improve this definition.
I'm also interested in if the community would be amenable to using this type of definition as the canonical GTFS definition such that we can generate Markdown/HTML from the programatic definition in JSON rather than visa-versa.
I found the old version of schemas to not work well with popular schema validators so I've started converting them to JSON schema v7. I've created a simple Nx app that converts .txt or .csv files into JSON objects and then validates them in the browser. My UI abilities are a bit lacking, currently the results are in console logs, but you can see my progress here: MuckT/gtfs-tools
As of writing this I have only rewritten the agency.txt schema; any help in the UI or schema development would be appreciated.
@MuckT - @LeoFrachet developed a full JSON Schema for GTFS which is in the PR linked to this issue. See the discussion in that PR and above problem statement for why frictionless seemed to fit the bill btter.
Note that validators are fairly easy to create once the schema is in a parsable format. You can also use goodtables.io to do data validation "as a service" in frictionless' format.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@github-actions Don't close.
Really interesting discussion!
Does anyone have any additions/edits to the following problem statement https://github.com/google/transit/issues/127#issuecomment-735904862
I’d like to add something that was in the original issue as well: that it should be easy, once the spec was processed, for a system using the spec to keep in sync with the latest additions to the spec. If an optional field was added for example, I’d want my codebase to create that new class on the next run, or I’d want a JSON schema I defined to add that property.
My own case: I created an RDF/Linked Data vocabulary for GTFS back in 2015. Today it’s horribly out of date, but we just started updating it to the latest spec: https://github.com/OpenTransport/linked-gtfs/pull/20
I wonder whether we should re-iterate the problem scope towards: what programmatic description should we use to make sure everyone can keep up to date their own technology-specific schema they can use in their own technology to import or validate GTFS static files? If this problem would be solved, then we also can have automaric translations towards commonly used schema languages like JSON and XML Schema, SQL, protobuf, RDF/SHACL/ShEx, etc.
The problem is thus not choosing the one schema language to rule them all, but it is choosing the one that will best express the things decided in the GTFS specification, so that it can be automatically translated to all others.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been closed due to inactivity. Issues can always be reopened after they have been closed.
This is still relevant.
Ack! Still relevant.
Wonder if we could use https://linkml.io for this. Seems to do what I described above
Re-opening :)
📢 The participants in this conversation might want to look at issue #391 to discuss adding the GeoJSON format in GTFS as part of the GTFS-Flex extension proposal.
I was curious if there was a language agnostic programmatic definition, similar to the proto definition for GTFS-Realtime, but for GTFS-Static, that can be used with an ORM for creating Class representation of the text files. This would simplify the coding process, and would eliminate the need to write code to represent each text file and possibly needing to update that code if/when the spec changes.