Closed cedarbaum closed 1 year ago
Quick follow-up - I believe I now understand the trip loop issue. Apparently trip ids are not unique for MTA bus trips. This causes transiter to try and reconcile the stop sequences of multiple trips running on the same route, which creates the cycles.
I believe creating an id from a combination of TripId
and VehicleId
will be unique, so this could be done as an extension in the gtfs
library. I'll look into this next.
Thanks for this PR! I’ve been wanting to add the NYC bus data for the longest time but the old Python version of Transiter was too inefficient so I didn’t even try. With the current version of Transiter I think it’s feasible and the background update process shouldn’t be too CPU intensive. Thanks!
I’m currently away from my computer while abroad, and would like to give this a spin before merging if that’s okay? Will be back next week. In the meantime two comments:
Should we rename the file us-ny-nycbus.yaml or something like that? My initial idea was that the naming convention would be country-state-system.yaml, in which case “buses” is potentially not specific enough for all of NY state. (I think “subway” is specific enough, though in retrospect nycsubway would also be reasonable.)
Thanks for looking into the service map failures! I’ve actually seen these with the subway feed, too! You’re right that Transiter doesn’t fail in this case: it updates all of the trip data, but keeps the old service maps. For the subway the error usually goes away after a few minutes, and because it doesn’t block the update, I never bothered to dig deeper. Do you know if this is a transient issue with the bus feed too?
Trip IDs being reused does sound like a more serious problem though which would be interesting to solve… I think the subway feed also has this problem occasionally. My recollection is that Transiter will silently ignore the second trip in this case when updating (which is not great tbh but it’s hard to think of a better way.) Maybe the service maps code doesn’t ignore subsequent trips which sounds like a bug.
By the way speaking of service maps, Transiter should be resilient to loops in trips because some systems do have such trips (the Circle Line in London, all of the Chicago “L” routes). The service maps feature probably should be more clever that is currently is…but that’s a huge scope problem for another time!
On Wed, Mar 22, 2023, at 8:02 PM, Sam Cedarbaum wrote:
Quick follow-up - I believe I now understand the trip loop issue. Apparently trip ids are not unique for MTA bus trips. This causes transiter to try and reconcile the stop sequences of multiple trips running on the same route, which creates the cycles.
I believe creating an id from a combination of
TripId
andVehicleId
will be unique, so this could be done as an extension in thegtfs
library. I'll look into this next.— Reply to this email directly, view it on GitHub https://github.com/jamespfennell/transiter/pull/97#issuecomment-1480477963, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4T7SVY5R35J7XUDE6T2O3W5OVLXANCNFSM6AAAAAAWDD6QZM. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks for the initial feedback, and it would be great to have you try it on your end when you are back and have time! In the meantime, I've done some more experimentation and have some additional notes below:
TripId
s are not unique. I believe this can be fixed (at least in the majority of cases) with this PR: https://github.com/jamespfennell/gtfs/pull/2TripId
s, stop sequences also don't appear to be stable across updates (though I want to double-check my work here, since this seems very strange, even for GTFS 🙂). For example, consider the following 2 updates for a trip:[{'stopSequence': 0, 'arrival': {'time': '1679544559'}, 'departure': {'time': '1679544559'}, 'stopId': '305733'},
{'stopSequence': 1, 'arrival': {'time': '1679544574'}, 'departure': {'time': '1679544574'}, 'stopId': '305734'},
{'stopSequence': 2, 'arrival': {'time': '1679544607'}, 'departure': {'time': '1679544607'}, 'stopId': '305735'}]
[{'stopSequence': 2, 'arrival': {'time': '1679544303'}, 'departure': {'time': '1679544303'}, 'stopId': '305734'},
{'stopSequence': 3, 'arrival': {'time': '1679544336'}, 'departure': {'time': '1679544336'}, 'stopId': '305735'},
{'stopSequence': 4, 'arrival': {'time': '1679544376'}, 'departure': {'time': '1679544376'}, 'stopId': '305736'}],
This causes some issues when transiter tries to merge the trip timeliness across updates, most noticeably the loop issue mentioned previously.
I was able to fix this by simply deleting all previous stop times on trip updates (i.e. MTA bus trips will only return current and futures stops now). Something more clever could probably be done to merge the timelines.
I just tried this out and I hit the GTFS parsing error you fixed in https://github.com/jamespfennell/gtfs/pull/1! I also noticed that when this parsing error is hit the client command (go run . install ....
) actually hangs which is not great, so I fixed that in https://github.com/jamespfennell/transiter/commit/ffab92568c79c9ce634f78134c353ae6c36f3c60.
Also trying to fix the CI failure right now, which is unrelated to your change.
Wow this system has over 12 thousand stops!
Using your fix in the GTFS package I was able to play around with this. It looks awesome! I sanity checked that the data from one of my local bus stops matches the MTA website and it does!
To merge this in, I think we're blocked on the GTFS Go package PR you created? And also, if we want to use that, I think we'll need to change the Transiter Go code to wire in that extension as a parser option? Just like how the subway does it. Unfortunately this is a big tedious...
Regarding service maps, I just realized that they're more complex than the NYC subway in a way that may need changes to the service maps feature itself (not in this PR, to be clear). For the subway, a northbound trip calls at the exact same stations as a southbound trip, just in the reverse order. For the buses this is not true at all. For example, my local bus the B57 goes down Court St on the way south and up Smith St on the way north. The Transiter generated service map isn't useful because it's basically all the northbound stops followed separately by all the southbound stops.
I wonder if Transiter should support creating northbound-only and southbound-only service maps? This could be specified in the service maps config. So for the buses there wouldn't just be a "realtime" service map; there would be a "realtime_northbound" map and a "realtime_southbound" map.
Hmmm just as an FYI, I'm getting lots of "service map cannot be topologically sorted" errors but no duplicate trip IDs. So perhaps there is another reason these maps can't be sorted, in addition to the trip ID duplication issue? I'm going to look further.
Hey @jamespfennell - did you see my theory about unstable stop sequence numbers in my last comment? It may explain why you're still seeing that error. When I completely replace all stop sequences with each new update (i.e. make no attempt to preserve trip history), I no longer get the error. I also validated this theory with a python script that compares stop sequences between updates for the same trip id. Let me know if that makes sense based on what you're seeing on your end.
EDIT: For reference, this is the hack I use to clear previous trip data in updateStopTimesInDB
to test this theory:
var stopTimePksToDelete []int64
for _, stopTime := range stopSequenceToStopTimePk {
stopTimePksToDelete = append(stopTimePksToDelete, stopTime.Pk)
}
if err := updateCtx.Querier.DeleteTripStopTimes(ctx, stopTimePksToDelete); err != nil {
return false, err
}
stopSequenceToStopTimePk = map[int32]db.ListTripStopTimesForUpdateRow{}
Ah yes, sorry I did read that comment last week but had forgotten it. Do you have any opinion on the best way to fix this? Should Transiter ignore the stop sequences numbers and generate its own? This is what it does for the subway.
Though I do wonder - is the MTA using stop sequences because some trips call at the same stop twice? In this case the GTFS realtime spec does require stop sequences be provided to disambiguate the stop times for the same stop.
By the way, I'm happy to merge in any version of this work that you're happy with. Things like improving the stop sequences situation can always be done in subsequent PRs (or not at all!). From my perspective, even just the new config file (along with the transfers.txt
fix) is a great addition. Totally up to you.
I think there is a quick fix for the service maps issue alone. When calculating the service map, we could skip all stops times that are in the past. It's a one line fix to this one SQL query - it just needs an additional WHERE trip_stop_time.past IS FALSE
clause. I should probably do this anyway it seems? Like the realtime service maps should show only current/future service
This wouldn't fix the trip data itself, and as in your comment above, we would still see duplicate past stops.
Thanks for the service maps fix! I agree that this can be merged safely with just that (after bumping gtfs
version and also addressing other comments). However, I also think there is some value in one or both of the below configuration options (which could be added to the YAML definitions):
1.) removeTripHistory
: this option, when true, would make no attempt to keep past trips. This would probably be fine for many use cases (e.g., arrival boards).
2.) reassignStopSequences
: this option (based on your suggestion) would ignore the stop sequences from GTFS and assign sequentially (assuming no duplicate stop ids). I think this would be OK in most cases but, as you mentioned above, it would fail for routes that do indeed revisit stops.
So going forward, I can cleanup this PR and potentially also work on (1) or (2) if you think this an OK approach. Let me know your thoughts!
Awesome, that sounds great!
I personally think (2) is the way to go because I think it will end up making the MTA buses work correctly with trip history. The implementation shouldn't be too bad either? After adding the reassignStopSequences
field to the feed config, I think it would just be a code change in realtime.go to clear the stop sequence numbers if the new config field is true. This clearing just needs to happen before the stop sequences are automatically assigned by Transiter.
If you wanted to implement that that would be awesome! Could do it in this PR, or a follow up PR.
By the way, I think the downside of the new configuration option only really applies if you're trying to link the realtime data with the static data. My understanding is that the stop sequences give you a way of linking a stop time prediction in the realtime data to the associated scheduled time in GTFS static. For the MTA buses this realtime-static linking will presumably always be broken anyway because of the bug you discovered. But it's definitely a limitation of the new option that we should document. (To be clear Transiter currently doesn't attempt to do this realtime-static linking, but it's something I want to do in the future because it's needed to get some transit systems working, like the SF BART.)
Another random interesting thing (sorry for the brain dump) is that the MTA bus transit system is really big, and when I play with it locally I'm seeing trip feed updates take 7 or 8 seconds. Having this system in the repo will be great for performance benchmarking. I just took a pprof profile out of curiosity, and am finding that a lot of the slowness is because Transiter updates trip stop times with individual SQL queries, one at a time. This is slow just because for each SQL query, a network call has to be made to Postgres. Maybe in the future there is some optimization that could be done based on this kind of profiling.
Another random interesting thing (sorry for the brain dump) is that the MTA bus transit system is really big, and when I play with it locally I'm seeing trip feed updates take 7 or 8 seconds. Having this system in the repo will be great for performance benchmarking. I just took a pprof profile out of curiosity, and am finding that a lot of the slowness is because Transiter updates trip stop times with individual SQL queries, one at a time. This is slow just because for each SQL query, a network call has to be made to Postgres. Maybe in the future there is some optimization that could be done based on this kind of profiling.
Interesting! I was sort of thinking about this a bit when I was looking at that code, but figured Postgres would do some magic to optimize these operations within the same transaction. Do you think a multi-row UPSERT
would help? Or, in the case of my hack to forget trip history, a full delete + multi-row INSERT
.
Yeah I was thinking a multi-row upsert (or even just update because inserts are relatively rare) would befaster. The main motivation would be to avoid the overhead of making a SQL call to Postgres for each stop time update update. Not sure how hard it is though! I think Postgres supports it, but not sure about sqlc
.
I created a new issue for the optimizing the feed updater, just so that I don't derail your PR :) https://github.com/jamespfennell/transiter/issues/100
Awesome, thank you!
About
This change adds static and realtime trip data for MTA buses.
Issue with trip loops
In my testing so far, this appears to work mostly out of the box except for one error that sometime occurs for routes:
I looked into this a bit, and it seems there are loops being created in some GTFS trips. For example, on the M96 route, I observed this:
My guess would be this is a bug with the feed, but it's hard to say for sure without a repro outside of transiter. I did try writing a small Python script to poll the feed and look for loops within trips, but I haven't yet seen the same behavior there (this could just be an issue of timing though).
It does seem transiter is resilient to this sort of error, though I am curious if you think it's something we need to understand better before using this config. Also, I am curious if you have any thoughts about where the bug most likely is and what could be done to further investigate.
Other notes:
As always, I appreciate any time you're able to spend looking over this and please let me know if you have any feedback when you have a chance.