google / transit

https://gtfs.org/
Apache License 2.0
581 stars 177 forks source link

Backwards compatibility and versioning #288

Open scmcca opened 2 years ago

scmcca commented 2 years ago

With new fields being added to the spec, we wanted to revisit the concept of backwards compatibility in GTFS. Currently, CHANGES.md reads:

When adding features to the specification, we want to avoid making changes that will make existing feeds invalid. We don't want to create more work for existing feed publishers until they want to add capabilities to their feeds. Also, whenever possible, we want existing parsers to be able to continue to read the older parts of newer feeds.

As implied, there are 2 parties to consider: data producers and data consumers.

For data producers, backwards-compatibility means that any changes to the spec do not invalidate or change the meaning of old datasets. That is to say, data producers won't need to change their datasets to make room for a new feature. This is seems to have been respected so far, and is favorable to maintain for reasons mentioned above.

For data consumers, backwards-compatibility means that any changes to the spec will only be built "on top" of current code. Meaning that if a data consumer choses not to implement a new spec feature, all other historical implementations remain valid with old and new feeds containing the addition (the consumer will simply be ignoring the feature, with no consequence on other rider information).

The problem comes in when adding features that are non-backwards compatible for data consumers.

A recent example of this is the voting-in of #284. With the addition of more specific trip-to-trip and route-to-route transfers, consumers now have to update their code to look for trip-to-trip and route-to-route on top of the existing stop-to-stop transfers. If they don't, data consumers may be misrepresenting the specificity of transfers leading to errors in rider information.

More non-backwards compatible changes for data consumers are foreseeable. Otherwise, the development and improvement of GTFS would be severely limited.

Much of these issues stem from the fact that GTFS has no versioning (discussed previously in #215). While there is expressed interest from the community to version GTFS, there is also concern from other parties that this will complexify GTFS by encouraging breaking changes for all parties, leading to lower usage because of an additional barrier to entry/comprehension.

With the context that backwards compatibility means different things for different parties, I'm wondering if versioning is possible while maintaining that changes should be backwards-compatible for data producers. This would keep GTFS easy to use for data producers, while providing a mechanism for data consumers to track added features that break implementation.

Looking forward to thoughts on this from all parties. Thanks!

flocsy commented 2 years ago

I disagree with "backwards compatibility means different things for different parties". I think it means the same for both parties: "the change doesn't make 'me' need to touch my code if I don't want to". The difference is that a consumer that has code that was written when only field1, field2 was in the spec doesn't know when an additional field3 will be added (to the spec / to the data).

So an example of a truly backwards compatible change is to add a new file or a new optional field to an existing file when by not reading the new file / new field all the rest will remain complete and true and the semantics of only reading the old fields would not change whether the new field is added by the producer or not or what it's value is. (examples could be: adding more info, like is_wheelchair_accessible or add an image [that shouldn't have text on it according to the spec])

An example of a not really backwards compatible change would be to add a new optional field2 such that if it's used then a previously existing optional field1 should be left empty. This sounds like backwards compatible from the producer's point of view (he is not forced to implement field2, the "old" GTFS they produce will still mean exactly the same) but it creates a problem for the consumers: the meaning of and empty field1 might now have a different meaning for a consumer that doesn't know about field2 than to a consumer that uses field2.

Because the point of GTFS is to communicate between producers and consumers, it is in the interest of both of them to have "only" backwards compatible changes when possible. But because of the same interest adding versioning would enable improvements that cannot be backwards compatible. However I think that if we add versioning it should also be part of the spec that it is recommended that providers continue to provide the old version for at least X time. Actually I think that it should be part of the versioning that the producers will announce the end-of-life date of their older version feeds.

For example we could have a new file: version.csv: gtfs_version: integer (no need to complicate IMHO) newer_version_available: 0 | 1 end_of_life_date: date - required if newer_version_available is 1

It would also be useful if there could be a "automatic discovery" of newer versions. What I mean by that is that if a consumer already implemented version:2, and a producer that until now had only version:1 starts to produce version:2 today, then the consumer should be able to detect the v2 files / url and automatically use it instead of v1 without any manual intervention. This could be either having filenames with _v1.csv _v2.csv ending or having directories in the zip file: v1/, v2/ or having urls: http://example.com/gtfs_root/v1/... http://example.com/gtfs_root/v2/... or it could be one or more of the optional fields in version.csv: next_version_directory, next_version_url

barbeau commented 2 years ago

I agree with @flocsy's observations above.

An example of a not really backwards compatible change would be to add a new optional field2 such that if it's used then a previously existing optional field1 should be left empty.

A good example of this type of subtle breaking change that has already happened in GTFS is the introduction of pathways.txt, which changed the stop_lat and stop_lon fields from Required to Conditionally Required. Our gtfs-realtime-validator was written prior to the introduction of pathways and therefore made the assumption that stops.txt would always have a latitude and longitude. So it crashed on the first ingestion of an official GTFS dataset that included pathways and we had to make a code change to accommodate missing stop lats and longs under certain conditions.

I think that if we add versioning it should also be part of the spec that it is recommended that providers continue to provide the old version for at least X time. Actually I think that it should be part of the versioning that the producers will announce the end-of-life date of their older version feeds....It would also be useful if there could be a "automatic discovery" of newer versions.

FWIW, GBFS is currently voting on a new and more detailed versioning proposal here that GTFS could take inspiration from, including a definition of "Long Term Support (LTS)" for "current versions" of the spec (GBFS already supports auto-discovery as well): https://github.com/NABSA/gbfs/pull/386

paulswartz commented 2 years ago

I don't like the idea of having the version be specified in the URL, unless the URL itself is present in the GTFS file in some fashion. GBFS has a gbfs_versions.json which could be a good model here: maybe versions.txt with version,url as the columns?

flocsy commented 2 years ago

@paulswartz I think my idea is compatible with what you linked. What I wanted to add is the "automatic discovery". For this 2 things are needed: 1. the producer producing both the old and the new versions for a while, 2. in the old version there should be enough information to detect not only the existance of the new version but if it's a version already implemented by the consumer then also to find it's location (this is usually a url with the version somewhere in the url, or could also be a different directory in the same zip (though that sounds like wasteful because that would mean everyone would download twice as much data as they actually need/use)

So we could have a version.csv: version: integer end_of_life_date: date - required if newer version is available

So we could have a versions.csv: version: integer (no need to complicate IMHO) end_of_life_date: date - required if newer_version_available is 1 url: link to where the given version is downloadable (I guess as zip?)

It would not make much sense to include versions.csv in each version, it could be in some url that never changes... Not really sure, a bit tricky.

paulswartz commented 2 years ago

I think having a semantic version might be better than a single number. Maybe adding pathways.txt was a breaking change deserving of a new Major version, but #49 (adding text-to-speech fields) could be a Minor version as it's only adding a new field.

derhuerst commented 2 years ago

I agree with most notes about the versioning itself:

It would also be useful if there could be a "automatic discovery" of newer versions. What I mean by that is that if a consumer already implemented version:2, and a producer that until now had only version:1 starts to produce version:2 today, then the consumer should be able to detect the v2 files / url and automatically use it instead of v1 without any manual intervention. This could be either having filenames with _v1.csv _v2.csv ending or having directories in the zip file: v1/, v2/ or having urls: http://example.com/gtfs_root/v1/... http://example.com/gtfs_root/v2/... or it could be one or more of the optional fields in version.csv: next_version_directory, next_version_url

I'd argue against making the structure of the GTFS .zip archive more complicated:

There are many existing technical systems available to implement such a mechanism. This is why I'd strongly suggest not to make up another one; Specifically, I prefer to add a recommendation to use HTTP content negotiation (MDN explainer & great explanation and call for it) instead of separate URLs.

A closely related topic: In the MobilityData GTFS Slack, there has been a discussion on how to express versions and/or modification times of a feed accessible in a machine-readable way, and there is a related GBFS Issue: https://github.com/NABSA/gbfs/issues/394.

flocsy commented 2 years ago

@derhuerst so, what do you propose in order to be able to discover that a newer version is present, if you are against both having an independent versions.csv and adding it (and updating it as necessary) to all the GTFS versions?

e-lo commented 2 years ago

✅ Semantic versioning ✅ LTS designations and transition periods ✅ A strategy for adopting a more modern architecture > .zip

e-lo commented 2 years ago

🥫 🧠 on "auto discovery" of versions.

derhuerst commented 2 years ago

@derhuerst so, what do you propose in order to be able to discover that a newer version is present, if you are against both having an independent versions.csv and adding it (and updating it as necessary) to all the GTFS versions?

Independent as in "next to the .zip file? I think it's a good idea! I'd rather choose an existing standard though, e.g. Data Package.

In addition, we could add a recommendation to the spec to use HTTP content negotiation (MDN explainer & great explanation and call for it). It's HTTP-exclusive though, so it wouldn't work for FTP, USB sticks, etc.

Although I agree theoretically, in practice, no registry will ever contain all feeds. Some of the feeds will always exist in private and/or experimental contexts, and IMHO a versioning mechanism should work there as well. This is entirely my personal preference, but I'd like to keep the metadata close to the data it describes, and aggregate (in the "copy" sense) it into registries.

flocsy commented 2 years ago

@derhuerst I don't see how content negotiation is relevant. But maybe I don't understand what you mean. If I interpret content negotiation correctly then it enables the client to get a "given" content in a number of different "formats". The semantic of the given content is the same in this case (json vs xml, English vs French...). However when we talk about different versions then we talk about different meaning by definition.

But the bigger problem with content negotiation is that if that (or something similar in the http level) would become the standard then it would force producers to do changes to their servers that might not be that easy. So I think it's better to think of versioning that can be achieved by the "tools" we know every producer and consumer is capable to use. Adding files and fields should be good enough (also +1 for being close to the data). If and when there will be an universal directory for the registered feeds, then that directory may have a protocol that uses content negotiation or any other feature that is specific to it.

derhuerst commented 2 years ago

I don't see how content negotiation is relevant. But maybe I don't understand what you mean. If I interpret content negotiation correctly then it enables the client to get a "given" content in a number of different "formats".

Yes, but the example I linked to above explains that using this mechanism for different "format versions" of the same content (thus, the different variants are semantically equivalent) aligns well with the principles that mime types and the HTTP principles habe designed around.

FWIW, GitHub uses this mechanism for versioning its REST API, and quite a few others use it as well. They represent the same semantic context (e.g. a repository) in different formats.

The semantic of the given content is the same in this case (json vs xml, English vs French...). However when we talk about different versions then we talk about different meaning by definition.

Aren't two "format versions" (e.g. GTFS-Static v2.1 with GTFS-Fares v2 & GTFS-Static v3.0 with GTFS-Fares v2) of the same data (e.g. whole transit system of Tokyo, 2021-12-12 until 2022-06-06) semantically equivalent a.k.a. have the same meaning?

Of course, different "data versions" (different time frames, or updated/fixed data) are a different thing and shouldn't be negotiated via HTTP content negotiation!

But the bigger problem with content negotiation is that if that (or something similar in the http level) would become the standard then it would force producers to do changes to their servers that might not be that easy.

That's why I propose to add it as a recommendation (just like e.g. feed_info.txt currently is). More motivated and technically skilled feed providers could help with automated & efficient consumption of their feed by following it; Most smaller agencies will likely never adopt it, and that's IMHO a reasonable trade-off.

So I think it's better to think of versioning that can be achieved by the "tools" we know every producer and consumer is capable to use. Adding files and fields should be good enough (also +1 for being close to the data).

From my personal experience in Germany, most transit authorities/agencies contract with more technical people for creation and sometimes serving of a GTFS feed anyways. But I understand that a standard snould as accessible as reasonably possible while keeping its technical goals.

paulswartz commented 2 years ago

Content negotiation works fine for APIs, but thinking of MBTA's GTFS file, it's uploaded to Amazon S3 and served through a CDN. We're unlikely to support content negotiation, as we'd need to add additional infrastructure to support that. However, serving files at different URLs is very easy.

skinkie commented 2 years ago

Content negotiation works fine for APIs, but thinking of MBTA's GTFS file, it's uploaded to Amazon S3 and served through a CDN. We're unlikely to support content negotiation, as we'd need to add additional infrastructure to support that. However, serving files at different URLs is very easy.

What about a GBFS like structure for multiple urls? That would also solve the problems with other updates for files more frequently changing than others.

gcamp commented 2 years ago

+1 semantic versioning.

+1 Automatic discovery. Could be a simple static .json (or CSV if we really want to) served on a static URL the same way GTFS are typically served right now that list the versions available and URL for download. Version should also be included in the GTFS itself but I would do auto discovery separately than the feed. Very similar to what @paulswartz proposed.

Unsure about changing the zip format to something else at this point. I'm not against adding a new way of getting GTFSs but that seems unrelated to version control and the way version control is created will need to support version 1.0 anyway.

derhuerst commented 2 years ago

Content negotiation works fine for APIs, but thinking of MBTA's GTFS file, it's uploaded to Amazon S3 and served through a CDN. We're unlikely to support content negotiation, as we'd need to add additional infrastructure to support that. However, serving files at different URLs is very easy.

Fair enough, that is probably the case for many feed providers.

You'll probably always need some level of server-side scripting to do such version-based content negotiation, even if it's just using the Apache/nginx config files.

But also keep in mind that redirecting is a perfectly valid mechanism as well. It would be possible to have such HTTP content negotiation and still serve the actual GTFS feeds from a CDN.

flocsy commented 2 years ago

@derhuerst

Aren't two "format versions" (e.g. GTFS-Static v2.1 with GTFS-Fares v2 & GTFS-Static v3.0 with GTFS-Fares v2) of the same data (e.g. whole transit system of Tokyo, 2021-12-12 until 2022-06-06) semantically equivalent a.k.a. have the same meaning?

I don't know about the changes in GTFS-Static's different versions, but I suspect they are not equivalent. For sure Fares V1 and Fares V2 are not equivalent. If it would then we wouldn't talk about Fares V2 for so long... (At the moment I'm not even convinced if V2 is backwards compatible with V1)

derhuerst commented 2 years ago

Aren't two "format versions" (e.g. GTFS-Static v2.1 with GTFS-Fares v2 & GTFS-Static v3.0 with GTFS-Fares v2) of the same data (e.g. whole transit system of Tokyo, 2021-12-12 until 2022-06-06) semantically equivalent a.k.a. have the same meaning?

I don't know about the changes in GTFS-Static's different versions, but I suspect they are not equivalent.

I'd say it depends on the change, but in most of the cases they're semantically equivalent.

(This might seem to get side-tracked, but bear with me.) Let's say I have two image files, representing the same (as in same subject, same framing, same lighting, etc) "scene" portraying my dog: One image file uses an image format that only supports 8 bit colors, the other image file uses a more modern (but differently structured and thus technically incompatible) image format that supports 10 bit colors.

Are the two pictures technically equivalent? Definitely not, as they use entirely different incompatible formats! Are they semantically equivalent? Within the scope of my metaphor, even though the former image doesn't contain as much color as the latter, I'd argue so because both represent the same "scene". (In this case, HTTP content negotiation would make a lot of sense, but that is besides the point.)

Applying this to GTFS, I'd argue that both imaginary GTFS datasets mentioned above (formatted in different major versions of the GTFS-Static spec) of course have different contents technically speaking, but both try to represent a transit schedule for a certain time frame with whatever means are available in the respective version of the spec.

I think what our discussion comes down to is two different kinds of "compatibility": Compatibility in a technical sense (as described by @flocsy & above by @barbeau) and in a semantical sense.

flocsy commented 2 years ago

@derhuerst I agree with your image example. For the same reason it would also apply if your 2 images used 2 different versions of GIF. However in case of GTFS in some cases it can't be. For example we are adding GTFS Fares V2 in order to be able to represent fares that are not possible in Fares V1. There can be producer that can totally represent the fares in V1 and they chose to (also) produce a V2. In this case the 2 will be equivalent. But another producer might not be able to represent the fares in V1 and thus the V1 and V2 for them won't be equivalent.

Thus Fares V1 and Fares V2 are not equivalent.

gcamp commented 2 years ago

What are the thoughts of doing a simple version repository JSON like proposed by @paulswartz and myself? It would be also very similar to how GBFS does it which is a plus.

flocsy commented 2 years ago

I am for it, but not everything is decided yet IMHO.

What most people agreed:

I don't see any problem also including the previous (still active) versions. This will enable backwards-compatible discovery. Though I'm not sure how much it's interesting. If consumer implements it now, then they can see the exsisting versions and decide from what url they download.

It would be nice to have a forwards-compatible auto discovery - meaning that if a consumer is capable of processing a newer version and the producer starts to provide it, then consumer could automatically switch to the newer version. But so far I did not see a reasonable solution to this. IMHO we'll need to add a versions.csv (or json) that is OUTSIDE of the gtfs zip. However the version.txt that's inside the zip could point to this file. This file is small, can be downloaded frequently for auto-discovery, and since it can be a static file even less advanced prodcucers can provide it (even caching with Etag, Last-modified headers)

In each gtfs zip we could add: version.txt (optional in 1.0, but required in any later versions) has 1 line with the following fields: version: required string with semantic versioning last_updated: required? timestamp supported_until: optional timestamp for ttl (only provide when newer version already exist and it is decide that this version will not be provided after this time) versions_url: optional string where the versions.csv can be found

versions.csv (optional, external file, not inside the gtfs zip) has 1 or more lines with these fields: version: required string last_updated: required? timestamp supported_until: optional timestamp for ttl (only provide when newer version already exist and it is decide that this version will not be provided after this time) url: required string

This allows auto discovery for newer versions and it's simple to implement.

skinkie commented 2 years ago

The problem with 'inside' a zip file is that you would have to crawl all variants until you match one of your flavor.

gcamp commented 2 years ago

I don't think we need a complete version.txt file inside the GTFS. A gtfs_version inside feed_info.txt would be sufficient I think in addition to an external file.

e-lo commented 2 years ago

The problem with 'inside' a zip file is that you would have to crawl all variants until you match one of your flavor.

Ideally this info would be stored in a data catalog somewhere that crawls all the data anyway...external to the data itself.

skinkie commented 2 years ago

The problem with 'inside' a zip file is that you would have to crawl all variants until you match one of your flavor.

Ideally this info would be stored in a data catalog somewhere that crawls all the data anyway...external to the data itself.

Ideally all the files would not be part of a zip, but individually gziped (or just send with the correct transfer encoding) so that all different flavors are available in a single folder, and have the ability not to change when other parts of the data do change.

westontrillium commented 2 years ago

Agree with @gcamp's suggestion to use feed_info.txt for a gtfs_version field. Are there any other pieces of versioning info we need that that file could host? Always good to leverage existing files instead of new ones if possible, and it seems like feed_info.txt could be a good fit for this kind of metadata.

flocsy commented 2 years ago

True, I forgot about feed_info.txt, so we can add the fields there. Which fields do we need to add and what do you think we should call them?

I propose: _gtfsversion, _gtfs_version_enddate (maybe gtfs_version_supported_until_date, but it's too long...), _gtfs_versionsurl

And for the versions.txt (I don't like it's called txt, better would be csv, or maybe even better json, but for simplicity (for sure every consumer and producer will be able to deal with it): should we call them the same as in feed_info.txt? gtfs_version, url, _gtfs_version_enddate, _feedversion, _lastupdated last_updated should be timestamp (often after release an error is discovered and fixed, so having a date is not enough) I think we can include both feed_version (that has to be the same as in the currently published feed_info.txt) and last_updated to make it easier for the consumers.

westontrillium commented 2 years ago

@flocsy Could you explain the function of a "versions.txt" file in relation to the proposed feed_info.txt fields? Maybe it's nestled in an earlier comment, but I don't yet understand what the separate versions.txt file would offer beyond additional fields in feed_info.txt.

flocsy commented 2 years ago

@westontrillium with feed_info.txt the consumer can know about the currently consumed feed. However with versions.txt they could also auto-discover when newer (probably more feature-reach) versions appear. The idea is this: consumers usually implement the new versions relatively fast, however most producers still only have the older version. Once the consumer is ready to read a new version, then whenever a producer is ready, it'll show up in versions.txt, and this the consumer's system can automatically switch to use the newer version.

flocsy commented 2 years ago

What happened to this discussion? I think we should decide ASAP and finalize and vote for this, because other new features (some even non backwards compatible maybe) are voted in almost on a weekly base nowadays (and that's a good thing!)

gcamp commented 2 years ago

@flocsy do you want to start a PR with what was discussed here?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

flocsy commented 1 year ago

@gcamp I see 2 problems: 1. as I recall there wasn't really any conclusion, 2. it doesn't seems to bother others that much. If I'm right about the 2nd thing then let's close it until the real need will arise (though I have a feeling that then it'll be a bit late :)

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It may be closed manually after one month of inactivity. Thank you for your contributions.

isabelle-dr commented 3 months ago

Re-opening this issue, since versioning is a recurring discussion topic.