jamespfennell / transiter

Web service for transit data
https://demo.transiter.dev
MIT License
55 stars 6 forks source link

Add MTA buses config #97

Closed cedarbaum closed 1 year ago

cedarbaum commented 1 year ago

About

This change adds static and realtime trip data for MTA buses.

Issue with trip loops

In my testing so far, this appears to work mostly out of the box except for one error that sometime occurs for routes:

service map cannot be topologically sorted: the provided graph is not topologically sortable because it contains a cycle

I looked into this a bit, and it seems there are loops being created in some GTFS trips. For example, on the M96 route, I observed this:

image

My guess would be this is a bug with the feed, but it's hard to say for sure without a repro outside of transiter. I did try writing a small Python script to poll the feed and look for loops within trips, but I haven't yet seen the same behavior there (this could just be an issue of timing though).

It does seem transiter is resilient to this sort of error, though I am curious if you think it's something we need to understand better before using this config. Also, I am curious if you have any thoughts about where the bug most likely is and what could be done to further investigate.

Other notes:

  1. The GTFS realtime feeds appear to work even without an API key specified in the URL. The official documentation still says to use one, so it is probably still best to follow that.
  2. The documented GTFS realtime feed URLs use HTTP for transport, but HTTPS does work so I've used that instead.

As always, I appreciate any time you're able to spend looking over this and please let me know if you have any feedback when you have a chance.

cedarbaum commented 1 year ago

Quick follow-up - I believe I now understand the trip loop issue. Apparently trip ids are not unique for MTA bus trips. This causes transiter to try and reconcile the stop sequences of multiple trips running on the same route, which creates the cycles.

I believe creating an id from a combination of TripId and VehicleId will be unique, so this could be done as an extension in the gtfs library. I'll look into this next.

jamespfennell commented 1 year ago

Thanks for this PR! I’ve been wanting to add the NYC bus data for the longest time but the old Python version of Transiter was too inefficient so I didn’t even try. With the current version of Transiter I think it’s feasible and the background update process shouldn’t be too CPU intensive. Thanks!

I’m currently away from my computer while abroad, and would like to give this a spin before merging if that’s okay? Will be back next week. In the meantime two comments:

  1. Should we rename the file us-ny-nycbus.yaml or something like that? My initial idea was that the naming convention would be country-state-system.yaml, in which case “buses” is potentially not specific enough for all of NY state. (I think “subway” is specific enough, though in retrospect nycsubway would also be reasonable.)

  2. Thanks for looking into the service map failures! I’ve actually seen these with the subway feed, too! You’re right that Transiter doesn’t fail in this case: it updates all of the trip data, but keeps the old service maps. For the subway the error usually goes away after a few minutes, and because it doesn’t block the update, I never bothered to dig deeper. Do you know if this is a transient issue with the bus feed too?

Trip IDs being reused does sound like a more serious problem though which would be interesting to solve… I think the subway feed also has this problem occasionally. My recollection is that Transiter will silently ignore the second trip in this case when updating (which is not great tbh but it’s hard to think of a better way.) Maybe the service maps code doesn’t ignore subsequent trips which sounds like a bug.

By the way speaking of service maps, Transiter should be resilient to loops in trips because some systems do have such trips (the Circle Line in London, all of the Chicago “L” routes). The service maps feature probably should be more clever that is currently is…but that’s a huge scope problem for another time!

On Wed, Mar 22, 2023, at 8:02 PM, Sam Cedarbaum wrote:

Quick follow-up - I believe I now understand the trip loop issue. Apparently trip ids are not unique for MTA bus trips. This causes transiter to try and reconcile the stop sequences of multiple trips running on the same route, which creates the cycles.

I believe creating an id from a combination of TripId and VehicleId will be unique, so this could be done as an extension in the gtfs library. I'll look into this next.

— Reply to this email directly, view it on GitHub https://github.com/jamespfennell/transiter/pull/97#issuecomment-1480477963, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4T7SVY5R35J7XUDE6T2O3W5OVLXANCNFSM6AAAAAAWDD6QZM. You are receiving this because you are subscribed to this thread.Message ID: @.***>

cedarbaum commented 1 year ago

Thanks for the initial feedback, and it would be great to have you try it on your end when you are back and have time! In the meantime, I've done some more experimentation and have some additional notes below:

  1. As mentioned previously, TripIds are not unique. I believe this can be fixed (at least in the majority of cases) with this PR: https://github.com/jamespfennell/gtfs/pull/2
  2. In addition to non-unique TripIds, stop sequences also don't appear to be stable across updates (though I want to double-check my work here, since this seems very strange, even for GTFS 🙂). For example, consider the following 2 updates for a trip:

Update 1

[{'stopSequence': 0, 'arrival': {'time': '1679544559'}, 'departure': {'time': '1679544559'}, 'stopId': '305733'},
 {'stopSequence': 1, 'arrival': {'time': '1679544574'}, 'departure': {'time': '1679544574'}, 'stopId': '305734'},
 {'stopSequence': 2, 'arrival': {'time': '1679544607'}, 'departure': {'time': '1679544607'}, 'stopId': '305735'}]

Update 2

[{'stopSequence': 2, 'arrival': {'time': '1679544303'}, 'departure': {'time': '1679544303'}, 'stopId': '305734'},
 {'stopSequence': 3, 'arrival': {'time': '1679544336'}, 'departure': {'time': '1679544336'}, 'stopId': '305735'},
 {'stopSequence': 4, 'arrival': {'time': '1679544376'}, 'departure': {'time': '1679544376'}, 'stopId': '305736'}],

This causes some issues when transiter tries to merge the trip timeliness across updates, most noticeably the loop issue mentioned previously.

I was able to fix this by simply deleting all previous stop times on trip updates (i.e. MTA bus trips will only return current and futures stops now). Something more clever could probably be done to merge the timelines.

jamespfennell commented 1 year ago

I just tried this out and I hit the GTFS parsing error you fixed in https://github.com/jamespfennell/gtfs/pull/1! I also noticed that when this parsing error is hit the client command (go run . install ....) actually hangs which is not great, so I fixed that in https://github.com/jamespfennell/transiter/commit/ffab92568c79c9ce634f78134c353ae6c36f3c60.

Also trying to fix the CI failure right now, which is unrelated to your change.

jamespfennell commented 1 year ago

Wow this system has over 12 thousand stops!

jamespfennell commented 1 year ago

Using your fix in the GTFS package I was able to play around with this. It looks awesome! I sanity checked that the data from one of my local bus stops matches the MTA website and it does!

To merge this in, I think we're blocked on the GTFS Go package PR you created? And also, if we want to use that, I think we'll need to change the Transiter Go code to wire in that extension as a parser option? Just like how the subway does it. Unfortunately this is a big tedious...


Regarding service maps, I just realized that they're more complex than the NYC subway in a way that may need changes to the service maps feature itself (not in this PR, to be clear). For the subway, a northbound trip calls at the exact same stations as a southbound trip, just in the reverse order. For the buses this is not true at all. For example, my local bus the B57 goes down Court St on the way south and up Smith St on the way north. The Transiter generated service map isn't useful because it's basically all the northbound stops followed separately by all the southbound stops.

I wonder if Transiter should support creating northbound-only and southbound-only service maps? This could be specified in the service maps config. So for the buses there wouldn't just be a "realtime" service map; there would be a "realtime_northbound" map and a "realtime_southbound" map.

jamespfennell commented 1 year ago

Hmmm just as an FYI, I'm getting lots of "service map cannot be topologically sorted" errors but no duplicate trip IDs. So perhaps there is another reason these maps can't be sorted, in addition to the trip ID duplication issue? I'm going to look further.

cedarbaum commented 1 year ago

Hey @jamespfennell - did you see my theory about unstable stop sequence numbers in my last comment? It may explain why you're still seeing that error. When I completely replace all stop sequences with each new update (i.e. make no attempt to preserve trip history), I no longer get the error. I also validated this theory with a python script that compares stop sequences between updates for the same trip id. Let me know if that makes sense based on what you're seeing on your end.

EDIT: For reference, this is the hack I use to clear previous trip data in updateStopTimesInDB to test this theory:

    var stopTimePksToDelete []int64
    for _, stopTime := range stopSequenceToStopTimePk {
        stopTimePksToDelete = append(stopTimePksToDelete, stopTime.Pk)
    }
    if err := updateCtx.Querier.DeleteTripStopTimes(ctx, stopTimePksToDelete); err != nil {
        return false, err
    }
    stopSequenceToStopTimePk = map[int32]db.ListTripStopTimesForUpdateRow{}
jamespfennell commented 1 year ago

Ah yes, sorry I did read that comment last week but had forgotten it. Do you have any opinion on the best way to fix this? Should Transiter ignore the stop sequences numbers and generate its own? This is what it does for the subway.

Though I do wonder - is the MTA using stop sequences because some trips call at the same stop twice? In this case the GTFS realtime spec does require stop sequences be provided to disambiguate the stop times for the same stop.

jamespfennell commented 1 year ago

By the way, I'm happy to merge in any version of this work that you're happy with. Things like improving the stop sequences situation can always be done in subsequent PRs (or not at all!). From my perspective, even just the new config file (along with the transfers.txt fix) is a great addition. Totally up to you.

jamespfennell commented 1 year ago

I think there is a quick fix for the service maps issue alone. When calculating the service map, we could skip all stops times that are in the past. It's a one line fix to this one SQL query - it just needs an additional WHERE trip_stop_time.past IS FALSE clause. I should probably do this anyway it seems? Like the realtime service maps should show only current/future service

This wouldn't fix the trip data itself, and as in your comment above, we would still see duplicate past stops.

cedarbaum commented 1 year ago

Thanks for the service maps fix! I agree that this can be merged safely with just that (after bumping gtfs version and also addressing other comments). However, I also think there is some value in one or both of the below configuration options (which could be added to the YAML definitions):

1.) removeTripHistory: this option, when true, would make no attempt to keep past trips. This would probably be fine for many use cases (e.g., arrival boards). 2.) reassignStopSequences: this option (based on your suggestion) would ignore the stop sequences from GTFS and assign sequentially (assuming no duplicate stop ids). I think this would be OK in most cases but, as you mentioned above, it would fail for routes that do indeed revisit stops.

So going forward, I can cleanup this PR and potentially also work on (1) or (2) if you think this an OK approach. Let me know your thoughts!

jamespfennell commented 1 year ago

Awesome, that sounds great!

I personally think (2) is the way to go because I think it will end up making the MTA buses work correctly with trip history. The implementation shouldn't be too bad either? After adding the reassignStopSequences field to the feed config, I think it would just be a code change in realtime.go to clear the stop sequence numbers if the new config field is true. This clearing just needs to happen before the stop sequences are automatically assigned by Transiter.

If you wanted to implement that that would be awesome! Could do it in this PR, or a follow up PR.


By the way, I think the downside of the new configuration option only really applies if you're trying to link the realtime data with the static data. My understanding is that the stop sequences give you a way of linking a stop time prediction in the realtime data to the associated scheduled time in GTFS static. For the MTA buses this realtime-static linking will presumably always be broken anyway because of the bug you discovered. But it's definitely a limitation of the new option that we should document. (To be clear Transiter currently doesn't attempt to do this realtime-static linking, but it's something I want to do in the future because it's needed to get some transit systems working, like the SF BART.)


Another random interesting thing (sorry for the brain dump) is that the MTA bus transit system is really big, and when I play with it locally I'm seeing trip feed updates take 7 or 8 seconds. Having this system in the repo will be great for performance benchmarking. I just took a pprof profile out of curiosity, and am finding that a lot of the slowness is because Transiter updates trip stop times with individual SQL queries, one at a time. This is slow just because for each SQL query, a network call has to be made to Postgres. Maybe in the future there is some optimization that could be done based on this kind of profiling.

cedarbaum commented 1 year ago

Another random interesting thing (sorry for the brain dump) is that the MTA bus transit system is really big, and when I play with it locally I'm seeing trip feed updates take 7 or 8 seconds. Having this system in the repo will be great for performance benchmarking. I just took a pprof profile out of curiosity, and am finding that a lot of the slowness is because Transiter updates trip stop times with individual SQL queries, one at a time. This is slow just because for each SQL query, a network call has to be made to Postgres. Maybe in the future there is some optimization that could be done based on this kind of profiling.

Interesting! I was sort of thinking about this a bit when I was looking at that code, but figured Postgres would do some magic to optimize these operations within the same transaction. Do you think a multi-row UPSERT would help? Or, in the case of my hack to forget trip history, a full delete + multi-row INSERT.

jamespfennell commented 1 year ago

Yeah I was thinking a multi-row upsert (or even just update because inserts are relatively rare) would befaster. The main motivation would be to avoid the overhead of making a SQL call to Postgres for each stop time update update. Not sure how hard it is though! I think Postgres supports it, but not sure about sqlc.

jamespfennell commented 1 year ago

I created a new issue for the optimizing the feed updater, just so that I don't derail your PR :) https://github.com/jamespfennell/transiter/issues/100

jamespfennell commented 1 year ago

Awesome, thank you!