stable (or semi-stable) trip identifiers

We've heard from at least two different users of Transitland that it would be helpful to provide stable identifiers for trips (or at least identifiers that are more stable than those included in many GTFS feeds' trips.txt file):

@barbeau wrote in January:

Here’s the link on the need for trip hashing in OneBusAway:

https://github.com/OneBusAway/onebusaway-android/issues/333

There is also a link to Conveyal’s implementation of trip hashing for OTP (and on that issue, a link to the algorithm we implemented for our TAD mobile app: https://github.com/opentripplanner/OpenTripPlanner/issues/1573).

As I mentioned, I’d love to see Transitland provide a mechanism to determine if the “same” trip survives a GTFS data update, and what that new equivalent trip_id is. Different apps may be able to tolerate more substantial changes to trips than others, so other metadata about what is the same and what is different may also be useful, although I believe that would also complicate things.

For example, for TAD (doing real-time transit navigation) we needed to know if the following was the same in a trip:

Origin stop (location)

Destination stop (location)

2nd-to-last stop (sequence relationship to destination stop)

Other portions of the trip, and even potentially the trip departure/arrival times, could change, and TAD would still function correctly. Changes to trip upstream of the origin stop also wouldn’t matter to us.

For OneBusAway, for arrival/departure reminders based on real-time info, we’re more interested in the following for a given origin stop:

Departure time of the “same” trip for a given stop (times may not need to be exact – the closest arrival without a tolerance threshold like 10 min may work).

Rough geometry of the trip downstream of the stop (i.e., could this trip still take the rider to their same unknown destination).

In OBA we wouldn’t care about changes to trip for stops upstream of the given origin stop in the trip.

But, even if other metadata isn’t available, just known if an equivalent trip as a whole existed would be very useful.

and more recently we heard from a transit agency (https://hellomapzen.zendesk.com/agent/tickets/808):

I was wondering if it is possible to use Transitland in a way that we can add a unique identifier for each trip.

The reason for this is because our vendor changes the trip id field every time we have a service change, even if the trip remains the same. While it is possible to keep the gtfs trip_id the same across service changes, we are not able to keep the trip_id the same for our realtime gtfs.

Adding some kind of unique identifer for our trips would allow us to maintain continuity across gtfs changes for metric purposes.

At present, we just include trip IDs as strings on RouteStopPatterns and on ScheduleStopPairs.

Ideas to consider for the future:

Create IDs for trips that combine the Onestop ID of the appropriate RouteStopPattern with a final component that represents the start time of the trip. This one make the overall Onestop ID unique by route + stop sequence + geometry + start time.
Have a separate trip model/endpoint that allows for querying by GTFS trip_id, allows cross-referencing with RouteStopPatterns, Stops, etc.

One major question to answer would be how long we persist IDs and records for trips. Erase them whenever we erase an old FeedVersion's ScheduleStopPairs? Or keep the trips, along with RouteStopPatterns to summarize past configurations of the transit network (even if service is no longer scheduled for some of those combinations)?

Timeframe: Not urgent.

I'm trying to integrate static feeds, realtime feeds and routing APIs from several public transportation operators/providers in Europe, so I have similar goals.

I need these stable IDs to have at least two properties:

They must be computable using relatively sparse data. As an example, using the rough geographic shape of an entire trip for its ID is not an option for me, because I often don't have access to that data (or its geographic route is dynamic anyways). For stable stop & trip IDs to be truly usable & universal, they must be easily computable with sparse data or client-side/offline.
Using the stable ID, I need to be able to query systems that don't (yet) know about it. Thus, I need to "reconstruct" basic identifying data like location/name/etc. from the ID itself.

As with any "one standard to obsolete all other existing ones", this ID scheme won't be perfect. There will likely be revised (but incompatible) 2nd version in the future, and therefore >=2 IDs for an entity.

I think the only reasonable way towards a globally stable, globally agreed-upon ID scheme is to store both the current best-effort of a stable ID (in order to gain experience with edge cases) as well as multiple local IDs (in order to keep compatibility with existing systems).

This is only remotely related, but the IPFS community has a lot how, on a meta-level, make addresses & address schemes future-proof (i.e. self-describing, upgradable):

We could use e.g. the multiaddr markup to encode multiple IDs of a operator/station/trip into one "package".

transitland / transitland-datastore

stable (or semi-stable) trip identifiers #713