google / transit

https://gtfs.org/
Apache License 2.0
578 stars 177 forks source link

Use Entity-Relationship Model as Definitive Reference #415

Open abyrd opened 8 months ago

abyrd commented 8 months ago

GTFS is unusual in that the specification describes a file format, and presents itself as a file format, without any direct description of the entity-relationship model that the file format encodes. Most proposed changes seem to involve long discussions about the serialized interchange format of this data model, rather than the data model itself, which is never pictured or formally described.

Would it be beneficial if the definitive point of reference in the GTFS spec, GTFS-RT spec, and all proposed extensions was a formal data model, with the serialization format and documentation following from the model, not the other way around?

Discussions might be much simpler if everyone was just looking at the same entity-relationship diagram. I suspect problems and inconsistencies would stand out more clearly and alternatives would be easier to communicate.

For years it has been a staple of research papers and guides making use of open transit data to reverse-engineer the entity-relationship model from the textual descriptions in the spec. For example:

To me it is really odd that the data model of GTFS is left as an exercise for the reader, which has to be rediscovered and shared over and over (as a topic of scientific research), while it would be the definitive core of many other specifications.

Rather than a purely visual vector drawing, I would expect the data model to be expressed in a formal language, allowing it to be validated and automatically turned into a class diagram, interactively in response to edits. Almost as a convenient side effect, this should allow automatically generating a source code skeleton in any number of languages.

I know some people are cringing at the thought of UML and all the Enterprise Ceremony around it. I am not specifically talking about UML, just anything that gets the job done and is pleasant to work with, and is structured and line-oriented enough for effective diffing and version control.

Here's one of the more promising examples I've identified: https://github.com/holistics/dbml/

This is used by: https://dbml.dbdiagram.io https://dbdocs.io

And here is the full DBML syntax: https://dbml.dbdiagram.io/docs

Here is a quick sketch data model of a subset of GTFS-static (just an example, probably contains mistakes): https://dbdiagram.io/d/GTFS-Scheduled-Core-6576ceab56d8064ca0c69140

And an image of the live rendered model diagram:

DBML-GTFS-Core

Note how the implicit "service" entity is rendered in the diagram, even though it's not materialized in GTFS feeds. This kind of implicit entities are something that seems to be causing confusion in recent change proposals and having them represented in the spec could make it much easier to communicate.

And the resulting published docs: https://dbdocs.io/abyrd/GTFS-Scheduled-Core?table=stop_times&schema=public&view=table_structure

And the source code used in the above sketch model:

Table stops {
  stop_id id [pk]
  stop_name string [null]
  stop_lat float [not null, note: 'range -90 to 90']
  stop_lon float [not null, note: 'range -180 to 180']
}

Table routes {
  route_id id [pk]
  route_name string [not null]
}

Table trips {
  trip_id id [pk]
  route_id id [ref: > routes.route_id]
  service_id string [ref: > services.service_id]
}

Table stop_times {
  trip_id id [ref: > trips.trip_id]
  stop_sequence integer
  stop_id id [ref: > stops.stop_id]
  arrival_time hhmmss [note: '''
    Conditionally Required:
    - Required for first and last stop in a trip
    - Required for timepoint=1
    - Optional otherwise
    ''']
  departure_time hhmmss
  pickup_type pick_drop [null, default: 0]
  drop_off_type pick_drop [null, default: 0]
  indexes {
    (trip_id, stop_sequence) [pk]
  }
}

Table services [note: "elided table"] {
  service_id id [pk]
}

Table calendar {
  service_id id [ref: - services.service_id]
  "monday-sunday" bool
  start_date integer [note: 'Format YYYYMMDD']
  end_date integer [note: 'Format YYYYMMDD']
}

enum pick_drop {
  0 [note: "Regularly scheduled"]
  1 [note: "No pickup available"]
  2 [note: "Phone agency"]
  3 [note: "Coordinate with driver"]
}

Table calendar_dates {
  service_id id [ref: > services.service_id]
  date date
  exception_type integer [
    note: '''
      1 = added
      2 = removed
    ''']
}
e-lo commented 8 months ago

☝️ Yes. Or even better, this: https://github.com/google/transit/issues/127 (which could be easily used to generate the above)

abyrd commented 8 months ago

By the way, I fully understand and appreciate that GTFS was created as the an antidote to “model entire world with angle brackets before implementing” traditional standardization processes like those of TCIP or NeTEx. People wanted to be able to create a stop without nesting XML 5 layers deep. But "formally defined data model" does not mean complex or unwieldy or excessive (as evidenced by DBML above).

GTFS started out as a CSV dump of a relational database at TriMet. People with some experience working with databases or data modeling tend to sense that and treat it as a relational model when discussing extensions. I think it benefits greatly from acknowledging that underlying structure.

At this point, the core of that specification has existed largely unchanged for a decade, and is used and directly or indirectly by millions of people. It’s no longer a handful of files from which every reader intuitively perceives the source structure.

abyrd commented 8 months ago

☝️ Yes. Or even better, this: #127 (which could be easily used to generate the above)

Thanks @e-lo for pointing that out. That ticket, and especially the comments on that ticket, are very related to what I'm proposing here. What I'm proposing is a bit stronger and has a different goal though, so I'd like to treat it as a separate issue.

I believe #127 is suggesting that the GTFS spec include, as an additional derived element or resource, a machine readable file for the specific purpose of generating source code to hold in-memory representations of GTFS data. In the present PR, I'm considering source code generation as a nice potential side effect but not the core purpose. I'm also proposing that we focus attention on the entity-relationship model itself, not the file that represents it, and give this model a very central role in GTFS, both technically and organizationally.

Regarding the problem statement that you and @dedavance listed in comments on that other issue in https://github.com/google/transit/issues/127#issuecomment-700299360 and https://github.com/google/transit/issues/127#issuecomment-735904862 I generally agree, but with the following emphases and differences in the context of the present issue:

A) Data model as definitive or canonical source. In a comment on the other issue you said: https://github.com/google/transit/issues/127#issuecomment-754297300

I'm also interested in if the community would be amenable to using this type of definition as the canonical GTFS definition such that we can generate Markdown/HTML from the programatic definition in JSON rather than visa-versa.

This is exactly what I'm talking about. The key idea being that the model is the source of truth, not just another derived artifact to be manually maintained in the spec. And to the extent that something like for example protobuf definition files can't be generated or patched directly from the entity-relationship model language, they should be derived by hand using the most mechanical or automatic process possible. Anything that emerges as non-obvious in this derivation should be clarified in the data model as the source of truth, not in the downstream artifacts.

B) The data model is equally central to GTFS, GTFS-RT, GTFS change proposals, and GTFS-RT change proposals. Proposals and discussions in all these places would primarily or at least initially concern changes to the data model, and which aggregates within that model are serialized as files or messages.

C) I'm suggesting that the governance and amendment process (currently under discussion in #413) be modified to take this into account. For example, an initial change proposal might consist of only a use case description and a patch to the data model. People would discuss and vote on only the data model change first, before anyone invested time in the minutiae of serialization into text files. Everyone would be looking at and discussing a single visual image. The rest of the process would need to be tweaked accordingly, this is just the initial idea.

I think GTFS has reached a size and complexity, both technically and in community structure, where its maintenance and viability may depend on this kind of formalization.

I'm starting to feel like this is almost a precondition for me to even participate in the change process. It is genuinely exhausting to reverse-engineer what personal data model each person has in their heads when reading each comment, and to try to describe fragments of that data model in text over and over, trying to determine whether one's own model matches others. Frequently I get the impression we are proceeding in the absence of even a mental data model, so the long-term coherence and stability of the spec will rely on the sheer chance of someone showing up at the right moment to infer and verify this shifting ghost of an idea.