google / transit

https://gtfs.org/
Apache License 2.0
600 stars 182 forks source link

Language agnostic programmatic definition of GTFS-Static #127

Open Subzidion opened 5 years ago

Subzidion commented 5 years ago

I was curious if there was a language agnostic programmatic definition, similar to the proto definition for GTFS-Realtime, but for GTFS-Static, that can be used with an ORM for creating Class representation of the text files. This would simplify the coding process, and would eliminate the need to write code to represent each text file and possibly needing to update that code if/when the spec changes.

skinkie commented 5 years ago

It is called SQL ;-) and I think it does exists via GTFSdb.

Subzidion commented 5 years ago

GTFSdb is all Python, nothing that's just SQL. I was thinking something more along the lines of what OneBusAway does with Hibernate, but usable in any language.

skinkie commented 5 years ago

GTFSdb produces SQL tables in different flavors. You could use that in your agnostic definition.

Subzidion commented 5 years ago

Using GTFSdb for this would still requiring updating the Python code to reflect any changes to the spec, running it to create your database schema, then using some other tool to map from the schema to classes in whatever language you want to use. Feels a bit cumbersome. I understand there's probably some class representation for most languages already created, but needing to check and hope they get updated if the spec gets changed seems annoying. If there was some definition, similar to the Hibernate one, that could be used in any language, it would make any of those language-specific GTFS-Static ORMs a lot easier.

barbeau commented 5 years ago

@Subzidion There isn't anything official, but the closest thing I'm aware of in concept to what you're looking for is this Data Package specification:

I started generalizing this to any GTFS: https://github.com/CUTR-at-USF/GTFS

I think I have some work stashed somewhere beyond what's currently in the above branch...

Subzidion commented 5 years ago

This Data Package Specification is exactly what I was looking for. Is there any way we can make this specification a part of the main GTFS package? I would think defining GTFS in terms of the JSON schema would help clarify ambiguity instead of attempting to dfeine the JSON schema from markdown.

barbeau commented 5 years ago

@Subzidion It's certainly possible.

Is there anyone else interested in this type of programmatically-readable schema definition for GTFS?

barbeau commented 4 years ago

Note that there is a proposal and discussion related to GTFS schemas happening at https://github.com/google/transit/pull/244.

devadvance commented 4 years ago

Similar to the discussion in #244, I strongly bias towards achieving consensus on the problem statement before proposing a standard.

It sounds like the core objective is to further codify GTFS to meet these criteria (broken into bullets for easier visual parsing):

It would be helpful to know if that's an accurate characterization, or if criterial like human-readable or CSV-specific need to be appended.

e-lo commented 4 years ago

@devadvance:

criterial like human-readable

I think human readability is an important consideration for transparency and maintainability since changes to the spec will be represented and discussed within the context of a pull-request and vote.

such that GFTS data can be processed and stored

AND

wesleyi23 commented 3 years ago

I am new to this community, so there are likely many nuances that I am missing. However I have some questions and comments about this issue. For some background I am working, in collaboration with CALTRANS, on an application to implement software to support V1 of the GTFS "Grading Standard."

A canonical, machine-readable version of the standard would be most helpful. I would like to be able to abstract away as much of the standard as possible from this application, so I don't need to make updates to this application each time the standard changes. Having a machine readable version of the standard, with at least an agreed upon format\structure, defined types, and enumerations, would be central to this goal.

So I have a couple of questions:

I also wanted to voice my support for developing a canonical JSON-Schema for the following reasons: it has already been developed (assuming the draft is up to date and there aren't any significant issues with it), it is the most widely used of the proposed standards, and to my knowledge it supports most of objectives that have been mentioned so far. However as I said there are likely many nuances that I am missing.

More importantly, I wanted to express that not having a standard machine readable standard creates a significant issue right at the beginning of any new development effort: How do I model the standard and how do I keep that model up to date as it evolves? Providing a schema file would help alleviate many of these issues and free up developer time for other work.

e-lo commented 3 years ago

@wesleyi23 do you have any additions/mods to @devadvance 's summary of the problem statement ?

wesleyi23 commented 3 years ago

@e-lo and @devadvance the only thing, I would add is that it would be nice if the solution not only included the correct structure, syntax, and relationships of GTFS static data, but also the file and field descriptions. I would make a pitch that these be represented in an HTML format, because there are some order lists, paragraphs, and other similar items. I think this would provide a more or less complete reproduction of the current standard documents.

e-lo commented 3 years ago

Does anyone have any additions/edits to the following problem statement?

  1. Machine-readable instructions that specify
  2. in a language-agnostic, storage-agnostic manner
  3. that is relatively standardized itself (such that there are existing tools and a potential ecosystem for testing as well as rendering in a "front-end" form)
  4. is human legible in its native form (to allow for easy git-diffs + increase likelihood of catching errors)
  5. which articulates the correct structure, syntax, bounds, and relationships
  6. of GTFS static data
  7. as well as the file and field descriptions
  8. such that GFTS data can be processed and stored
  9. and validated
  10. in a backwards and forwards compatible manner
e-lo commented 3 years ago

BTW - I saw that @stephen-gates started developing a Frictionless data package for GTFS and would be curious why it seems to have been abandoned?

barbeau commented 3 years ago

@e-lo My understanding is that https://github.com/Stephen-Gates/GTFS was created specifically for validating the South East Queensland GTFS data and was never intended to be a canonical schema for the general GTFS spec. For example, some of the location constraints defined for stop location lat/longs are specific to Queensland.

I started expanding Stephen's work in this branch to represent the entire spec a while back, but other priorities pulled my attention away: https://github.com/CUTR-at-USF/GTFS/tree/full-spec

You can see my changes in these two commits:

Here were the remaining TODOs I noted in 2016:

  • Review TODOs and FIXMEs - some constraints will break extensibility
  • Add missing files

If someone would want to pick up this work I'd certainly welcome the contribution.

e-lo commented 3 years ago

@barbeau Awesome and thanks for background. In your opinion is frictionless "the right" spec for achieving the objectives above? my main hesitation is lack of progress/movement recently in the organization.

@wesleyi23 It seems like Sean's repo is a good place to start.

barbeau commented 3 years ago

@e-lo It looked very promising to me, and the above work was mainly an experiment to see if it panned out. Unfortunately I don't have any experience with frictionless outside of the above so I can't say for sure.

wesleyi23 commented 3 years ago

@e-lo @barbeau At the moment I have a need for a schema document, so I am happy to put time in to developing one further.

Sean I reviewed your repo and I agree it could be a good place to start. There is also @LeoFrachet JSON-Schema file referenced in #244 which would also make a good starting place. As a new community member, I have no context or background to weigh the pros or cons of JSON-Schema vs Frictionless.

From a technical perspective they both appear to meet the the satisfy the identified problem statement, unless there is something I am missing.

Any guidance on which path to follow would be greatly appreciated.

jamespfennell commented 3 years ago

Why not just use a SQL database definition file? This can include unique constraints, foreign key constraints, enums, and so on.

GTFS static essentially defines a relational database schema; so to me it would seem the most natural choice but I may be missing something.

On Mon, Nov 30, 2020 at 12:53 PM wesleyi23 notifications@github.com wrote:

@e-lo https://github.com/e-lo @barbeau https://github.com/barbeau At the moment I have a need for a schema document, so I am happy to put time in to developing one further.

Sean I reviewed your repo and I agree it could be a good place to start. There is also @LeoFrachet https://github.com/LeoFrachet JSON-Schema file referenced in #244 https://github.com/google/transit/pull/244 which would also make a good starting place. As a new community member, I have no context or background to weigh the pros or cons of JSON-Schema vs Frictionless.

From a technical perspective they both appear to meet the the satisfy the identified problem statement, unless there is something I am missing.

Any guidance on which path to follow would be greatly appreciated.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/transit/issues/127#issuecomment-735942162, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4T7SRT5VN2M6D2HA3M6UTSSPLYVANCNFSM4GJIC4QQ .

skinkie commented 3 years ago

@jamespfennell one reason could be that some constraints are "OR".

e-lo commented 3 years ago

@jamespfennell : To my knowledge SQL definitions aren't designed be 'read in' as data other than for SQL – but I would be curious if somebody more familiar with various options could evaluate this option vis-a-vis the 10 points above.

barbeau commented 3 years ago

As a new community member, I have no context or background to weigh the pros or cons of JSON-Schema vs Frictionless.

My concern with the JSON-Schema is that we'd be introducing an entirely new encoding-specific concept to GTFS that doesn't currently exist there. I think it would also tempt some producers and consumers to "JSON-ize" GTFS data, and I see that further complicating an already complex ecosystem.

Frictionless Table Schema format was designed to represent tabular data, which is the current representation/encoding of static GTFS data (CSV files in a ZIP file). IMHO it seems a better fit to the existing GTFS spec, unless there is a limitation that that I don't know of.

Why not just use a SQL database definition file? This can include unique constraints, foreign key constraints, enums, and so on.

@jamespfennell Could you give me an example for a table in GTFS?

As @skinkie says there are some situations that won't be easy to model, like service_id in calendar_dates.txt, which in some cases is a primary key but in others is a foreign key (potentially within the same GTFS dataset, as evidenced by https://github.com/MobilityData/gtfs-validator/issues/397): https://github.com/google/transit/blob/master/gtfs/spec/en/reference.md#calendar_datestxt

wesleyi23 commented 3 years ago

After talking with folks, I am starting work to update and expand the Frictionless Schema, Stephen created for Queensland. I have forked @barbeau branch and will be working on here: https://github.com/wesleyi23/GTFS-Frictionless.

e-lo commented 3 years ago

Thank you to @wesleyi23 for creating a fairly complete definition of GTFS here: https://github.com/wesleyi23/GTFS-Frictionless

It would be great if all who are interested could add issues, contribute to, and improve this definition.

I'm also interested in if the community would be amenable to using this type of definition as the canonical GTFS definition such that we can generate Markdown/HTML from the programatic definition in JSON rather than visa-versa.

MuckT commented 3 years ago

I found the old version of schemas to not work well with popular schema validators so I've started converting them to JSON schema v7. I've created a simple Nx app that converts .txt or .csv files into JSON objects and then validates them in the browser. My UI abilities are a bit lacking, currently the results are in console logs, but you can see my progress here: MuckT/gtfs-tools

As of writing this I have only rewritten the agency.txt schema; any help in the UI or schema development would be appreciated.

e-lo commented 3 years ago

@MuckT - @LeoFrachet developed a full JSON Schema for GTFS which is in the PR linked to this issue. See the discussion in that PR and above problem statement for why frictionless seemed to fit the bill btter.

Note that validators are fairly easy to create once the schema is in a parsable format. You can also use goodtables.io to do data validation "as a service" in frictionless' format.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

derhuerst commented 2 years ago

@github-actions Don't close.

pietercolpaert commented 2 years ago

Really interesting discussion!

Does anyone have any additions/edits to the following problem statement https://github.com/google/transit/issues/127#issuecomment-735904862

I’d like to add something that was in the original issue as well: that it should be easy, once the spec was processed, for a system using the spec to keep in sync with the latest additions to the spec. If an optional field was added for example, I’d want my codebase to create that new class on the next run, or I’d want a JSON schema I defined to add that property.

My own case: I created an RDF/Linked Data vocabulary for GTFS back in 2015. Today it’s horribly out of date, but we just started updating it to the latest spec: https://github.com/OpenTransport/linked-gtfs/pull/20

I wonder whether we should re-iterate the problem scope towards: what programmatic description should we use to make sure everyone can keep up to date their own technology-specific schema they can use in their own technology to import or validate GTFS static files? If this problem would be solved, then we also can have automaric translations towards commonly used schema languages like JSON and XML Schema, SQL, protobuf, RDF/SHACL/ShEx, etc.

The problem is thus not choosing the one schema language to rule them all, but it is choosing the one that will best express the things decided in the GTFS specification, so that it can be automatically translated to all others.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity. Issues can always be reopened after they have been closed.

derhuerst commented 1 year ago

This is still relevant.

pietercolpaert commented 1 year ago

Ack! Still relevant.

Wonder if we could use https://linkml.io for this. Seems to do what I described above

isabelle-dr commented 1 year ago

Re-opening :)

eliasmbd commented 1 year ago

📢 The participants in this conversation might want to look at issue #391 to discuss adding the GeoJSON format in GTFS as part of the GTFS-Flex extension proposal.