jamespfennell / transiter

Web service for transit data
https://demo.transiter.dev
MIT License
55 stars 6 forks source link

Populate and use `started_at` column in `trip` table for disambiguating trips #125

Open cedarbaum opened 11 months ago

cedarbaum commented 11 months ago

The field started_at is currently not populated/used or used in the trip table. There are 2 reasons this can be useful:

  1. Per this article, the start date of a trip can be necessary when joining realtime data with static data.
  2. It can disambiguate duplicate trip IDs occurring at once (e.g., Trip_1 starts at 11:30PM and then an overlapping Trip_1 started at 12:01AM the next day).

(1) is a longer term concern associated with the work described in https://github.com/jamespfennell/transiter/issues/11, but (2) could prove useful for general data integrity with the existing API.

To accomplish this, thetransiter.public.trip.trip_route_pk_id_key, will have to be changed to incorporate the started_at field as well (e.g., transiter.public.trip.trip_started_at_route_pk_id_key). This will break the assumption used throughout the system that, at any given point, there is a unique trip ID per route. For example, /systems/{system_id}/routes/{route_id}/trips/{trip_id} always returns a single trip. I believe this can be mostly solved with the below changes:

  1. Whenever a Trip or Trip.Reference is returned by the API, also return the started_at field.
  2. For the .../trips/{trip_id} endpoint, add an optional query parameter ?started_at={date} to disambiguate multiple trips with the same ID. If this query parameter is not provided, always return the earliest trip. I believe the default case matches what would happen today, since the later trip could not be added to the table until the earlier trip ends.

@jamespfennell please let me know if agree with above problem statement and if you think this sounds like a workable solution.

jamespfennell commented 11 months ago

Ah this is gnarly :) I think your solution would certainly work, though it may make the API somewhat confusing because it kind of breaks the REST semantics.

I wonder as an alternative could we transform the trip ID for every realtime trip that comes into Transiter, and use that as the "trip ID" that we store in the database? We could then persist the regular trip ID as original_trip_id or something like that. The transformed trip ID could be <trip_id>_<started_at_date>_<started_at_time>, with defaults of 00-00-000 and 00:00 if the started at fields are not provided.

cedarbaum commented 11 months ago

Good points! I think changing the trip id format is definitely nicer from a REST perspective, but my concern is that it loses the 1-to-1 mapping with the source GTFS content. A couple other solutions I was thinking about:

  1. Change .../trips/{trid_id} endpoint to return a list of trips with further references to individual trip URIs: .../trips/{trid_id}/{start_time}.
  2. Allow both .../trips/{trid_id} and .../trips/{trid_id}_{start_time} to match to the same resource in cases where there is no ambiguity. If multiple trips are running with the same (GTFS) trip ID, then only allow .../trips/{trid_id}_{start_time} to work and return 404 otherwise.

Curious to know your thoughts! Also I should mention this isn't, from my view, an urgent issue, just something I was thinking about while reading GTFS documentation. My intuition is it's relatively rare in the wild, but I've neither run into it nor gone out of my way to look for such cases as of yet.

jamespfennell commented 10 months ago

I believe we could maintain the 1-1 mapping. Suppose we had the following convention for the "normalized trip ID":

In this case given a normalized trip ID, you can split on the last two _ to get the original trip ID, start date and time back, irrespective of the structure of trip ID (which may itself contain _ characters).

As you say in option (2), we could also have options where this other trip ID is an alias for the regular trip ID, and so the API would work with either option as long as there is no ambiguity.

Our conversation so far has been very theoretical :) it would be interesting I think to find examples of systems that have this issue (maybe Amtrack?). Also it seems to be related to the GTFS frequencies.txt file which is a way to define many trips with the same trip ID but offset from each other.