GTFS to Component and Train Conversion

wklumpen commented 2 years ago

Would be good to have an ability to "read" a GTFS feed into a set of useable components based on some specified pre-determined parameters and settings.

The functionality would take one or more GTFS zipfiles and generate components.json, routes.json, and trains.json based on a fairly simple set of criteria. Users could then make modification to components to more easily set up the network.

wklumpen commented 1 year ago

We'll need to think carefully about this conversion, as it requires a bit of decision making. In particular:

Some way of telling the converter to map stops/stations to specific component types
Some default way of setting parameters for the mapped component types
How to handle what is a "tour" and what is a "route" (Probably make use of the trip->route structure in GTFS)
How to handle deadheading or otherwise instantiating items

wklumpen commented 5 months ago

Hey @peterlai1, I wanted to check if there was any work that you ended up doing on this as a part of getting the GO system up and running.

Just before we have other contributions (from Omar) go too far down the path.

peterlai1 commented 5 months ago

Hey @peterlai1, I wanted to check if there was any work that you ended up doing on this as a part of getting the GO system up and running.

Just before we have other contributions (from Omar) go too far down the path.

The GO Train network model was built using ATLS data from Metrolinx instead of GTFS, so there wasn't any work done on this front. I did use GTFS as part of the creation process for the animations, though that's probably a separate feature discussion.

omar-kabbani commented 5 months ago

Thank you both for setting all this up! These are my thoughts on how we can translate GTFS into spur input

Date/time
- Maybe as a start the user will have to choose a service_id from the GTFS dataset in calendar.txt, which would correspond to a typical day. Ex: Focus the analysis on a weekday only.
Components
- Tracks (u and v) can be obtained by parsing stop_times.txt, since we can get the order of stations in trips (ordered by stop_sequence). If the script picks up something like A->B->C and C->B->A (A, B, C are stations), then we can assume 2 track operations (?). One track for each direction, so we can assign key 1 to one direction and key 2 to the other direction - this is definitely a big simplification but it's probably a good start.
- I don't think we can get any capacity information from GTFS, but we can guesstimate the traversal_time which would be the difference of departure_time at the former stop and arrival_time at the latter stop.
- I think we can get the distance between stations too, but that might require some more work using shapes.txt, the station coordinates are of no use since they give us the distance as the crow flies.
Routes
- stop_times.txt can be used to identify the routes (those would be trip_id in GTFS).
Tours
- I don't think there's an equivalent to this in GTFS - but by parsing stop_times.txt and identifying the repeated routes (in spur lingo) and trips (in GTFS lingo), we can create a set of tours and their corresponding start times.
Yards can be autogenerated at the end of the line for now
Trains
- I think we can leave those out as a start, and have the user set this up
It's a good idea to parse routes.txt (link) and only pick up the ones that correspond to rail/subway - that will minimize a lot of noise in datasets that combine rail and bus in one GTFS dataset (such as the TTC)

That was a lot of text, I can try to come up with a prototype - hopefully that would be a bit more digestible.

wklumpen commented 5 months ago

Hi Omar, thanks for looking into this!

A few thoughts:

The goal here should be to parse a "reasonable" skeleton network from GTFS into the graph/json scheme we use for Spur. I can't imagine a situation where an import might not need some tweaking (single track operations, for example). I would start with some reasonable assumptions but leave some flags open to incorporate different logic if a user supplies "single track where possible" or whatever. We can sort that out after the basic thing gets going.
This will definitely get incorporated into a UI so making it as "callable" as possible (or just leaving it as its own script for now) is useful
I would suggest using routes.txt for routes, and then trips.txt as a way of stitching tours together. You can use the logic that when a trip ends, and another trip begins after that from the termius station, the vehicle just heads back the other way. Might be a way of simplifying the overall number of required agents down.
As for trains, the simplest thing would be to try and get the smallest number of tours possible based on the timing logic in the GTFS an then just assign a train to each of them (again this could be flagged by the user).
Agreed: I would start by only using route_types that are rail-based.

peterlai1 commented 5 months ago

Regarding tours, the block_id field in trips.txt is perfect for that! Each block_id is associated with a set of trips and they are meant to be operated continuously back to back, often by the same vehicle (not necessarily going back in the opposite direction on the same line, could be operating a different line in the case of interlining). This was actually my original reason for creating the tours entity, as a way to replicate blocks in the GTFS to cut down on redundancy. While different agencies encode their block info slightly differently in GTFS, for the first take of this I'd say doing a one-to-one adaptation of blocks in GTFS as tours in Spur will be adequate.

omar-kabbani commented 5 months ago

Update: So I developed something to extract TimedTrack and SimpleStation components - mainly using stop_times.txt since the order of stations is there. If trips go from station A to B and from B to A, then we're assuming double tracks, so I used key to differentiate them (with values 1 or 2 to make the distinction). Also, I used the average scheduled time from A to B (difference of arrival time at B and departure time at A) to estimate the value of traversal_time. Formatting still needs some work, but the idea is there. I pasted below sample input/output, if you want to take a look - feel free to let me know what you think.

Will continue working on the other components.

Sample Input (stop_times.txt)

```javascript trip_id,arrival_time,departure_time,stop_id,stop_sequence trip1,07:00:00,07:00:00,yonge,1 trip1,07:05:00,07:05:00,bayview,2 trip1,07:20:00,07:20:00,bessarion,3 trip2,07:15:00,07:15:00,bessarion,1 trip2,07:20:00,07:20:00,bayview,2 trip2,07:30:00,07:30:00,yonge,3 trip3,07:00:00,07:00:00,yonge,1 trip3,07:10:00,07:10:00,bayview,2 trip3,07:30:00,07:30:00,bessarion,3 ```

Sample Output

```javascript { "station0": { "type": "SimpleStation", "name": "yonge_1", "u": "yonge_1", "v": "yonge_2", "key": 1 }, "station2": { "type": "SimpleStation", "name": "bayview_2", "u": "bayview_1", "v": "bayview_2", "key": 2 }, "station1": { "type": "SimpleStation", "name": "bayview_1", "u": "bayview_1", "v": "bayview_2", "key": 1 }, "station3": { "type": "SimpleStation", "name": "bessarion_2", "u": "bessarion_1", "v": "bessarion_2", "key": 2 }, "stationbessarion_1": { "type": "SimpleStation", "name": "bessarion_1", "u": "bessarion_1", "v": "bessarion_2", "key": 1 }, "stationyonge_2": { "type": "SimpleStation", "name": "yonge_2", "u": "yonge_1", "v": "yonge_2", "key": 2 } }{ "edge0": { "type": "TimedTrack", "u": "yonge_2", "v": "bayview_1", "key": 1, "traversal_time": 450.0 }, "edge2": { "type": "TimedTrack", "u": "yonge_2", "v": "bayview_1", "key": 2, "traversal_time": 600.0 }, "edge1": { "type": "TimedTrack", "u": "bayview_2", "v": "bessarion_1", "key": 1, "traversal_time": 1050.0 }, "edge3": { "type": "TimedTrack", "u": "bayview_2", "v": "bessarion_1", "key": 2, "traversal_time": 300.0 } } ```

wklumpen commented 5 months ago

That works well Omar - that's exactly what I ended up doing as a one-off for Line 4 Subway in Toronto.

omar-kabbani commented 4 months ago

Just to make sure I understood the concept of tours My understanding is that in the example below (from the sample TTC Line 4), there's a train doing Tour-1971266, and this train does the following:

Heads westbound and makes 5 stops (Don Mills, Leslie, Bessarion, Bayview, and Yonge) - and departure indicates the departure times at each of these stations
Then heads eastbound and makes these 5 stops (Yonge, Bayview, Bessarion, Leslie, and Don Mills) - and departure indicates the departure times at each of these stations
Heads westbound again (same as step 1)

Did I get that right?

[
  {
    "name": "Tour-1971226",
    "creation_time": 0,
    "deletion_time": 86400,
    "routes": [
      {
        "name": "R-Westbound",
        "args": [
          {
            "departure": 20850
          },
          null,
          {
            "departure": 20975
          },
          null,
          {
            "departure": 21080
          },
          null,
          {
            "departure": 21193
          },
          null,
          {
            "departure": 21354
          }
        ]
      },
      {
        "name": "R-Eastbound",
        "args": [
          {
            "departure": 21450
          },
          null,
          {
            "departure": 21633
          },
          null,
          {
            "departure": 21720
          },
          null,
          {
            "departure": 21795
          },
          null,
          {
            "departure": 21954
          }
        ]
      },
      {
        "name": "R-Westbound",
        "args": [
          {
            "departure": 22170
          },
          null,
          {
            "departure": 22295
          },
          null,
          {
            "departure": 22400
          },
          null,
          {
            "departure": 22513
          },
          null,
          {
            "departure": 22674
          }
        ]
      },

peterlai1 commented 4 months ago

Hi Omar, yes that is correct. In each traversal of a route within a tour, each item in the args list correspond one-to-one to each component listed in the route in order. In this example, the departure time arguments apply to the station components, and no args (null) are applied to the track components between stations.

omar-kabbani commented 4 months ago

Thanks @peterlai1

Update: I think the logic works - but the formatting needs a bit more work

I also need to add a few more things such as ignore non-train routes in GTFS and at the end replace all stop_id values with actual stop names to make things more human readable

But also, do you have any thoughts regarding fields that require user-input (ex: which "station" corresponds to the yard, and the capacities of the yards, stations, and tracks). I am thinking of setting these as default values for now (so the first component of a route is always a yard, track capacity = 10, and mean boarding/alighting time = 20)

Also, speaking of defaults: I set the tour creation time to zero and deletion time to 2 days (some GTFS datasets go a few hours over 24h but I haven't seen anything go beyond that since it's typically describing the transit schedule for a day)

Please let me know if you had other thoughts/direction on all this

I pasted a sneak peak of the input/output below if you want to take a look

Sample Input

`trips.txt` (Ignore that this is a bus route for now) ```javascript route_id,service_id,trip_id,trip_headsign,trip_short_name,direction_id,block_id,shape_id,wheelchair_accessible,bikes_allowed A,1,trip1,EAST - 10 VAN HORNE towards VICTORIA PARK,,0,100,998007,1,1 A,1,trip2,EAST - 10 VAN HORNE towards VICTORIA PARK,,1,100,998007,1,1 A,1,trip3,EAST - 10 VAN HORNE towards VICTORIA PARK,,0,200,998007,1,1; ``` `stop_times.txt` ```javascript trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled trip1,07:00:00,07:00:00,yonge,1 trip1,07:05:00,07:05:00,bayview,2 trip1,07:20:00,07:20:00,bessarion,3 trip2,07:15:00,07:15:00,bessarion,1 trip2,07:20:00,07:20:00,bayview,2 trip2,07:30:00,07:30:00,yonge,3 trip3,07:00:00,07:00:00,yonge,1 trip3,07:10:00,07:10:00,bayview,2 trip3,07:30:00,07:30:00,bessarion,3 ```

Sample Output

`components.json` ```javascript [ { "type": "SimpleStation", "u": "yonge_1", "v": "yonge_2", "key": 1 }, { "type": "SimpleStation", "u": "bayview_1", "v": "bayview_2", "key": 1 }, { "type": "SimpleStation", "u": "bessarion_1", "v": "bessarion_2", "key": 1 }, { "type": "SimpleStation", "u": "bessarion_1", "v": "bessarion_2", "key": 2 }, { "type": "SimpleStation", "u": "bayview_1", "v": "bayview_2", "key": 2 }, { "type": "SimpleStation", "u": "yonge_1", "v": "yonge_2", "key": 2 }, { "type": "TimedTrack", "u": "yonge_2", "v": "bayview_1", "key": 1, "traversal_time": 450.0 }, { "type": "TimedTrack", "u": "bayview_2", "v": "bessarion_1", "key": 1, "traversal_time": 1050.0 }, { "type": "TimedTrack", "u": "bessarion_2", "v": "bayview_1", "key": 2, "traversal_time": 300.0 }, { "type": "TimedTrack", "u": "bayview_2", "v": "yonge_1", "key": 2, "traversal_time": 600.0 } ] ``` `routes.json` ```javascript [ { "name": "trip1", "components": [ { "u": "yonge_1", "v": "yonge_2", "key": 1 }, { "u": "yonge_2", "v": "bayview_1", "key": 1 }, { "u": "bayview_1", "v": "bayview_2", "key": 1 }, { "u": "bayview_2", "v": "bessarion_1", "key": 1 }, { "u": "bessarion_1", "v": "bessarion_2", "key": 1 } ] }, { "name": "trip2", "components": [ { "u": "bessarion_2", "v": "bessarion_1", "key": 2 }, { "u": "bessarion_1", "v": "bayview_2", "key": 2 }, { "u": "bayview_2", "v": "bayview_1", "key": 2 }, { "u": "bayview_1", "v": "yonge_2", "key": 2 }, { "u": "yonge_2", "v": "yonge_1", "key": 2 } ] }, { "name": "trip3", "components": [ { "u": "yonge_1", "v": "yonge_2", "key": 1 }, { "u": "yonge_2", "v": "bayview_1", "key": 1 }, { "u": "bayview_1", "v": "bayview_2", "key": 1 }, { "u": "bayview_2", "v": "bessarion_1", "key": 1 }, { "u": "bessarion_1", "v": "bessarion_2", "key": 1 } ] } ] ``` `tours.json` ```javascript [ { "name": "100", "creation_time": 0, "deletion_time": 172800, "routes": [ { "name": "trip1", "args": [ { "departure": 25200 }, null, { "departure": 25500 }, null, { "departure": 26400 } ] }, { "name": "trip2", "args": [ { "departure": 26100 }, null, { "departure": 26400 }, null, { "departure": 27000 } ] } ] }, { "name": "200", "creation_time": 0, "deletion_time": 172800, "routes": [ { "name": "trip3", "args": [ { "departure": 25200 }, null, { "departure": 25800 }, null, { "departure": 27000 } ] } ] } ] ```

peterlai1 commented 4 months ago

Thanks Omar for your great work so far! I like what you have proposed as the default parameter values. For now we can just use that and maybe have some primitive way for users to customize these constants (later on we can think about ability to customize component types and associated params). The deletion time of 2 days works, I don't think I've yet come across a GTFS feed that has a service day longer than 30 hours, so even that might be sufficient as well, though 2 days is definitely safer.

A couple of things regarding the input/outputs:

For routes.json, trip1 and trip3 traverse through the exact same set of components in the exact same order, so they are duplicates and we should actually only keep one of them. Let's say we keep trip1, then in tour 200, it should use trip1 as the route
- This might make the conversion logic more complicated in the sense that it'll need to scan across all trips for the unique set of sequences of stops/components to use as the set of routes, though this is the purpose of having the routes file - to avoid having to encode duplicates of the same sequence of components traversed by multiple trips. Let me know if you need any further clarification on this.
When you generate the traversal time between two stations, are you taking the average value across all trips based on their stop times at the two stations?
Not sure if it's an intentional test case, but I noticed that trip1 is scheduled to arrive at Bessarion later than when the next trip (trip2) is scheduled to leave Bessarion even though the trips are meant to be connected via the tour

omar-kabbani commented 3 months ago

I see I see Okay I updated the script to fix that

routes.json now only shows trip1 and trip2
tours.json now shows trip1 and trip2 under the 100 tour and trip1 under the 200 tour

For your second point, yeah I am averaging the difference of arrival time at next station and departure time at current station for the station pairs for all trips (you can double check the numbers here https://github.com/spur-sim/spur/issues/27#issuecomment-2016908486 and see if that's what you had in mind)

For your third point - that one's on me - I think I brainlessly entered these times, but you're right, the first departure time for trip2 should be after the last arrival time for trip1!

Will keep working on this next week and will keep you posted

omar-kabbani commented 2 months ago

Check out PR https://github.com/spur-sim/spur/pull/71 !

wklumpen commented 4 weeks ago

So - I think there's a few things that make GTFS especially tricky. One of them is bundling into tours, as tours require that the last component of tour n is the first component of tour n+1. To do this properly you'd have to make some assumptions about layovers, etc.

First, I think @peterlai1 is right about the block IDs being useful for tours, but they are not required for basic GTFS. So with that in mind I'm going to suggest a very simple approach that makes one big simplification: Each trip is run by a distinct vehicle. That is, each trip on a route is its own tour.

My thinking of the logic is this: Build the graph first, using route.txt to identify rail routes (allow folks to filter as needed), then identify all trips.txt using that route, and then use stop_times.txt to build the components network assuming a bi-drecitonal graph. If a stop_id_1 to stop_id_2 link exists already it won't get duplicated. Once the graph is built we can dump the components back into JSON just by iterating through the edges.

Then, we simply create a tour for each trip in trips.txt, create a vehicle in trains.json that is assigned to it, and off we go.

Then, if we want to do something more compact and realistic with block IDs we can do that to as a second feature.

I'm going to try and move the logic @omar-kabbani has already wonderfully put together over into this method, and I will probably just incorporate it into a function. I'm thinking it's time for a spur.data module to sit alongside the spur.core module.

Thanks again for everyone's help! I think once we have a basic GTFS converter we'll be able to very rapidly prototype a lot of things, even if there are some simplifications.

spur-sim / spur

GTFS to Component and Train Conversion #27