jamespfennell / transiter

Web service for transit data
https://demo.transiter.dev
MIT License
55 stars 6 forks source link

Handle vehicle GTFS entities #110

Closed cedarbaum closed 11 months ago

cedarbaum commented 11 months ago

Overview

This change allows Vehicle entities parsed from the GTFS library to be added to the DB and also implements several API endpoints for retrieving vehicles. There are 3 significant changes/design decisions that are worth calling out and may require further discussion:

  1. Vehicles without an ID are ignored. IDs are not technically required by the spec, since the associated trip can also be used as an identifier. However, requiring them simplifies this initial implementation and removing this requirement in the future should not require significant changes.
  2. Instead of having a foreign key reference to a vehicle's associated trip (trip_pk), the trip_pk column is replaced with trip_id and the associated trip doesn't have to exist in the trip table. This is to avoid cases where concurrent updates to trip and vehicle tables interfere with each other and cause insert/update errors (e.g., the associated trip was deleted) or deadlocks.
  3. The onlyProcessFullEntities option is added for realtime feeds, which allows entity references (e.g., the minimal trip info in a vehicle entity) to be ignored. This is to stop feeds from conflicting with each other in updates.

Schema changes

Realtime update changes

API changes

Feed config changes

us-ny-buses system config changes

Testing

jamespfennell commented 11 months ago

Thank you for this PR - it looks like it was a lot of work!! The scope was bigger than I thought: I didn't realize we could support geographic search for vehicles, but it's a great idea! Also I appreciate the effort you put into the hugely helpful PR description.

Because this is a large PR, I'll just make bigger picture comments in this round of review and then in a subsequent round can look at more detail at the code, if that sounds okay.

First, addressing the 3 things you highlighted at the start:

  1. Vehicles without an ID are ignored...

Given the scope of the feature, I think it's the right decision to ignore non-id vehicles for the moment. Adding support later will probably be fine. There is also the possibility we never will support these - it's not clear that these vehicles exist in the wild?

  1. Instead of having a foreign key reference to a vehicle's associated trip (trip_pk), the trip_pk column is replaced with trip_id...

I'm personally a big fan of the data integrity guarantees that foreign keys give, so instinctively I don't love the idea of the trip_id column. However, if I understand correctly, is the problem that we may parse a GTFS realtime feed that has a VehiclePosition but the associated TripUpdate is in a different feed? And so a regular trip_pk foreign key won't work because there is no trip yet to foreign key into? This is indeed tricky, and is related to your point (3) so I'll discuss that more below.

Btw, just you mentioned insert/update errors and deadlocks so I just want to clarify what Postgres gives us here. If we have two concurrent updates that are inconsistent (say a vehicle is being inserted that foreign keys into a preexisting trip, but we're concurrently deleting that trip), Postgres will fail one of the transactions, the associated feed update will fail, and the other feed update will succeed. The end state in the database will never be inconsistent. It will be equivalent to as if just one of the updates was never started at all. This is a guarantee we get from Postgres's implementation of transactions.

For realtime feeds, where by default we update every 5 seconds, this happening very occasionally is totally fine in my opinion. We'll just retry the failing update, and given the new state of the database the retry will succeed. Finally, feed updates in Transiter are fast (~400ms for the 123456 NYC subway feed, which is the slowest one) so overlapping feed updates are not super common? But maybe the bus is worse.

  1. The onlyProcessFullEntities option is added for realtime feeds...

This is tricky! When thinking about adding vehicle support I never realized this would happen! If I understand correctly, the problem is that we may get a trip update that has isFullEntity: true, say, and then in a subsequent update for a different feed, get the same trip but with isFullEntity: false. With the current logic we would just blow away the first (real!) trip update because our current algorithm just blindly overwrites data, irrespective of the isFullEntity field. And so the idea of this field is to skip updates that have isFullEntity false.

Two thoughts.

Firstly, adding the new config seems like the simplest solution and so I'm totally happy to do that for the moment, especially given the scope of this work. In general though I don't love these kinds of configs because to use them correctly you have to learn something about the feed dynamics. The ideal case is that Transiter works out of the box and handles whatever is thrown at it. But again, it seems like a good thing to do for the moment.

Second, I wonder is the following a general solution to this issue. We could add the isFullEntity to the database in the trip and vehicle columns, and persist it (but not return it in the API). Then the updater logic would be:

If this sounded good we could do this in a follow-up.

cedarbaum commented 11 months ago

Thanks for the initial feedback! Adding some follow-ups below:

  1. Vehicles without an ID are ignored...

Given the scope of the feature, I think it's the right decision to ignore non-id vehicles for the moment. Adding support later will probably be fine. There is also the possibility we never will support these - it's not clear that these vehicles exist in the wild?

Sounds like we're aligned here!

  1. Instead of having a foreign key reference to a vehicle's associated trip (trip_pk), the trip_pk column is replaced with trip_id...

I'm personally a big fan of the data integrity guarantees that foreign keys give, so instinctively I don't love the idea of the trip_id column. However, if I understand correctly, is the problem that we may parse a GTFS realtime feed that has a VehiclePosition but the associated TripUpdate is in a different feed? And so a regular trip_pk foreign key won't work because there is no trip yet to foreign key into? This is indeed tricky, and is related to your point (3) so I'll discuss that more below.

Btw, just you mentioned insert/update errors and deadlocks so I just want to clarify what Postgres gives us here. If we have two concurrent updates that are inconsistent (say a vehicle is being inserted that foreign keys into a preexisting trip, but we're concurrently deleting that trip), Postgres will fail one of the transactions, the associated feed update will fail, and the other feed update will succeed. The end state in the database will never be inconsistent. It will be equivalent to as if just one of the updates was never started at all. This is a guarantee we get from Postgres's implementation of transactions.

Your understanding is correct here. While running updates using a strict foreign key requirement, there would occasionally be transaction failures and deadlocks at the DB level. However, I unfortunately don't have a good sense for how often this occurs with the MTA buses feed. My initial implementation was based on the initial schema design with a strict foreign key requirement, so I can try to rerun this and get a better sense for how often it happens. I am happy to go ahead with either design and I may have gotten too caught up in trying to get to 0 transaction conflicts 😅.

If you have time and would also like to experiment with the different implementations in practice, I have a branch for each:

Aside from how these implementations relate vehicles to trips, they should behave the same.

  1. The onlyProcessFullEntities option is added for realtime feeds...

This is tricky! When thinking about adding vehicle support I never realized this would happen! If I understand correctly, the problem is that we may get a trip update that has isFullEntity: true, say, and then in a subsequent update for a different feed, get the same trip but with isFullEntity: false. With the current logic we would just blow away the first (real!) trip update because our current algorithm just blindly overwrites data, irrespective of the isFullEntity field. And so the idea of this field is to skip updates that have isFullEntity false.

Definitely agree that this is not a great long-term solution. That being said, I think it is a sufficiently complex problem that a follow-up would be better (especially since this change is already a bit large). Please let me know if there's anything from my end needed for this currently (e.g., adding comments or creating another task/issue).

jamespfennell commented 11 months ago

That's interesting about seeing the transaction failures on the bus system! Do you happen to have the logs that you see in that case? Would definitely be good to look out for when doing future iterations of this work.

Overall, everything here sounds good. I left a few comments so just waiting for those small changes I guess!

cedarbaum commented 11 months ago

I ran the https://github.com/cedarbaum/transiter/tree/vehicles_trip_pk branch for ~2.5 hours and it looks like it encountered deadlocks 9 times. Below are the transiter logs as well as an example deadlock error description from the DB.

So it's not too bad, but also not quite as rare as I initially hoped. At this point I have both implementations (trip_pk and trip_id) up to date with your initial comments, so we can go forward with either. Let me know if you have any thoughts given this new data.

Transiter logs (updating trip and vehicle entities every 5 seconds for us-ny-buses):

time=2023-08-02T15:33:59.448-04:00 level=INFO msg="starting system scheduler" system_id=us-ny-buses
time=2023-08-02T15:41:42.682-04:00 level=ERROR msg="failed update with reason FAILED_UPDATE_ERROR and error ERROR: deadlock detected (SQLSTATE 40P01) in 2.637281708s" system_id=us-ny-buses feed_id=trips update_id=462832cd-8784-47f6-a9f1-c8b5fd08f418
time=2023-08-02T15:42:07.952-04:00 level=ERROR msg="failed update with reason FAILED_UPDATE_ERROR and error ERROR: deadlock detected (SQLSTATE 40P01) in 2.90654975s" system_id=us-ny-buses feed_id=trips update_id=fcdad44d-e42a-495f-a077-b031f50773f8
time=2023-08-02T15:57:58.735-04:00 level=ERROR msg="failed update with reason FAILED_UPDATE_ERROR and error ERROR: deadlock detected (SQLSTATE 40P01) in 3.683029833s" system_id=us-ny-buses feed_id=trips update_id=38c9fa7d-9382-4461-86e9-b50565972480
time=2023-08-02T17:03:48.936-04:00 level=ERROR msg="failed update with reason FAILED_UPDATE_ERROR and error ERROR: deadlock detected (SQLSTATE 40P01) in 3.852355333s" system_id=us-ny-buses feed_id=trips update_id=ed666a60-f6dd-4387-9c43-e5126ae1be02
time=2023-08-02T17:22:57.561-04:00 level=ERROR msg="failed update with reason FAILED_UPDATE_ERROR and error ERROR: deadlock detected (SQLSTATE 40P01) in 2.481488291s" system_id=us-ny-buses feed_id=trips update_id=3bc7fc2f-c17f-4e02-bda9-cbd80582c4f8
time=2023-08-02T17:23:23.289-04:00 level=ERROR msg="failed update with reason FAILED_UPDATE_ERROR and error ERROR: deadlock detected (SQLSTATE 40P01) in 3.21523575s" system_id=us-ny-buses feed_id=trips update_id=6bec1aec-ecb2-4b46-848d-65259d97ad69
time=2023-08-02T17:23:48.135-04:00 level=ERROR msg="failed update with reason FAILED_UPDATE_ERROR and error ERROR: deadlock detected (SQLSTATE 40P01) in 3.059139458s" system_id=us-ny-buses feed_id=trips update_id=d25f3b35-1aee-4f30-a5e8-41ab60cb0c65
time=2023-08-02T17:48:22.671-04:00 level=ERROR msg="failed update with reason FAILED_UPDATE_ERROR and error ERROR: deadlock detected (SQLSTATE 40P01) in 2.595097792s" system_id=us-ny-buses feed_id=trips update_id=da18d765-9bad-4301-86a6-ddb176c34896
time=2023-08-02T17:53:48.700-04:00 level=ERROR msg="failed update with reason FAILED_UPDATE_ERROR and error ERROR: deadlock detected (SQLSTATE 40P01) in 3.630056416s" system_id=us-ny-buses feed_id=trips update_id=cf10db71-5f8c-4ae1-ad6f-c6073b3767b0
^Ctime=2023-08-02T18:10:55.601-04:00 level=INFO msg="recieved cancellation signal; starting server shutdown"

An example error from the DB logs:

2023-08-02 19:41:42.681 UTC [56722] ERROR:  deadlock detected
2023-08-02 15:41:42 2023-08-02 19:41:42.681 UTC [56722] DETAIL:  Process 56722 waits for ShareLock on transaction 301871; blocked by process 56719.
2023-08-02 15:41:42     Process 56719 waits for ShareLock on transaction 301870; blocked by process 56722.
2023-08-02 15:41:42     Process 56722: -- name: DeleteStaleTrips :many
2023-08-02 15:41:42     DELETE FROM trip
2023-08-02 15:41:42     WHERE
2023-08-02 15:41:42         trip.feed_pk = $1
2023-08-02 15:41:42         AND NOT trip.pk = ANY($2::bigint[])
2023-08-02 15:41:42     RETURNING trip.route_pk
2023-08-02 15:41:42 
2023-08-02 15:41:42     Process 56719: -- name: UpdateVehicle :exec
2023-08-02 15:41:42     UPDATE vehicle
2023-08-02 15:41:42     SET trip_pk = $1,
2023-08-02 15:41:42         label = $2,
2023-08-02 15:41:42         license_plate = $3,
2023-08-02 15:41:42         current_status = $4,
2023-08-02 15:41:42         latitude = $5,
2023-08-02 15:41:42         longitude = $6,
2023-08-02 15:41:42         bearing = $7,
2023-08-02 15:41:42         odometer = $8,
2023-08-02 15:41:42         speed = $9,
2023-08-02 15:41:42         congestion_level = $10,
2023-08-02 15:41:42         updated_at = $11,
2023-08-02 15:41:42         current_stop_pk = $12,
2023-08-02 15:41:42         current_stop_sequence = $13,
2023-08-02 15:41:42         occupancy_status = $14,
2023-08-02 15:41:42         feed_pk = $15,
2023-08-02 15:41:42         occupancy_percentage = $16
2023-08-02 15:41:42     WHERE vehicle.pk = $17
2023-08-02 15:41:42 
2023-08-02 15:41:42 2023-08-02 19:41:42.681 UTC [56722] HINT:  See server log for query details.
2023-08-02 15:41:42 2023-08-02 19:41:42.681 UTC [56722] CONTEXT:  while updating tuple (29,26) in relation "vehicle"
2023-08-02 15:41:42     SQL statement "UPDATE ONLY "public"."vehicle" SET "trip_pk" = NULL WHERE $1 OPERATOR(pg_catalog.=) "trip_pk""
2023-08-02 15:41:42 2023-08-02 19:41:42.681 UTC [56722] STATEMENT:  -- name: DeleteStaleTrips :many
2023-08-02 15:41:42     DELETE FROM trip
2023-08-02 15:41:42     WHERE
2023-08-02 15:41:42         trip.feed_pk = $1
2023-08-02 15:41:42         AND NOT trip.pk = ANY($2::bigint[])
2023-08-02 15:41:42     RETURNING trip.route_pk
2023-08-02 15:41:42 
2023-08-02 15:42:07 2023-08-02 19:42:07.943 UTC [56722] ERROR:  deadlock detected
2023-08-02 15:42:07 2023-08-02 19:42:07.943 UTC [56722] DETAIL:  Process 56722 waits for ShareLock on transaction 301893; blocked by process 56719.
2023-08-02 15:42:07     Process 56719 waits for ShareLock on transaction 301892; blocked by process 56722.
2023-08-02 15:42:07     Process 56722: -- name: DeleteStaleTrips :many
2023-08-02 15:42:07     DELETE FROM trip
2023-08-02 15:42:07     WHERE
2023-08-02 15:42:07         trip.feed_pk = $1
2023-08-02 15:42:07         AND NOT trip.pk = ANY($2::bigint[])
2023-08-02 15:42:07     RETURNING trip.route_pk
2023-08-02 15:42:07 
2023-08-02 15:42:07     Process 56719: -- name: UpdateVehicle :exec
2023-08-02 15:42:07     UPDATE vehicle
2023-08-02 15:42:07     SET trip_pk = $1,
2023-08-02 15:42:07         label = $2,
2023-08-02 15:42:07         license_plate = $3,
2023-08-02 15:42:07         current_status = $4,
2023-08-02 15:42:07         latitude = $5,
2023-08-02 15:42:07         longitude = $6,
2023-08-02 15:42:07         bearing = $7,
2023-08-02 15:42:07         odometer = $8,
2023-08-02 15:42:07         speed = $9,
2023-08-02 15:42:07         congestion_level = $10,
2023-08-02 15:42:07         updated_at = $11,
2023-08-02 15:42:07         current_stop_pk = $12,
2023-08-02 15:42:07         current_stop_sequence = $13,
2023-08-02 15:42:07         occupancy_status = $14,
2023-08-02 15:42:07         feed_pk = $15,
2023-08-02 15:42:07         occupancy_percentage = $16
2023-08-02 15:42:07     WHERE vehicle.pk = $17
2023-08-02 15:42:07 
2023-08-02 15:42:07 2023-08-02 19:42:07.943 UTC [56722] HINT:  See server log for query details.
2023-08-02 15:42:07 2023-08-02 19:42:07.943 UTC [56722] CONTEXT:  while updating tuple (182,41) in relation "vehicle"
2023-08-02 15:42:07     SQL statement "UPDATE ONLY "public"."vehicle" SET "trip_pk" = NULL WHERE $1 OPERATOR(pg_catalog.=) "trip_pk""
2023-08-02 15:42:07 2023-08-02 19:42:07.943 UTC [56722] STATEMENT:  -- name: DeleteStaleTrips :many
2023-08-02 15:42:07     DELETE FROM trip
2023-08-02 15:42:07     WHERE
2023-08-02 15:42:07         trip.feed_pk = $1
2023-08-02 15:42:07         AND NOT trip.pk = ANY($2::bigint[])
2023-08-02 15:42:07     RETURNING trip.route_pk
2023-08-02 15:42:07 
jamespfennell commented 11 months ago

Thanks for the logs! Super interesting.

I filed #114 as a follow-up for this issue. As I mention there, there is a correctness concern that is perhaps even worse than the deadlocks. If we have a trip, and then we update the vehicles feed that references the trip but with IsEntityInFeed: false, then I think we'll actually delete all of the trip's data. It will be replaced by the IsEntityInFeed: false trip that contains no data. In the API this will manifest as the stop times for the trip appearing (when the trips feed is updated) and then disappearing (when the vehicles feed is updated) over-and-over again. Does this sound right?

Because of this I definitely think we should be conservative here and tackle the full solution separately. For the moment your new setting onlyProcessFullEntities allows us to dodge the problem entirely.

cedarbaum commented 11 months ago

Just read through that issue again and I think we're in agreement for the necessity of onlyProcessFullEntities for now (at least for systems with both vehicle and trip feeds). I originally considered just always ignoring partial entities but, for systems like the NYC subway, it is harmless to allow partial vehicles to populate the vehicle table.

To clarify the above experiment, this was done with with the onlyProcessFullEntities option enabled for both feeds but using the original trip_pk foreign key. This causes periodic deadlocks but there shouldn't be any correctness concerns.

We then also have the design where trip_pk is replaced with trip_id, allowing us to avoid deadlocks (since vehicle and trip tables are fully decoupled in the update logic) but has less data integrity guarantees, since a trip_id may not exist in the trip table. Such vehicle entities with non-existent trips may be exposed to API consumers, but since they are joined with the trip table at fetch time, we at least shouldn't ever expose an invalid trip reference.

So I would argue each approach, in conjunction with the onlyProcessFullEntities option, is reasonably correct but the trade-off is more failed updates (trip_pk) vs a less data integrity guarantees (trip_id).

jamespfennell commented 11 months ago

To clarify the above experiment, this was done with with the onlyProcessFullEntities option enabled for both feeds but using the original trip_pk foreign key. This causes periodic deadlocks but there shouldn't be any correctness concerns.

Ah, thanks for the clarification! So looking more closely at the error, it seems that we're concurrently deleting the trip (because it has been removed from the trips feed) but updating a vehicle to point at the trip (because the trip is still referenced in the vehicles feed). And this just keep repeating until one feed update wins due to some timing luck.

To me, this is a bug that we should fix (but maybe in a follow-up because it seems pretty delicate?). We should be able to handle concurrent updates like this. If there is a conflict, the resolution should be one of the transactions failing, and the other succeeding, not both failing due to a deadlock. Or maybe both succeeding if we're clever enough.

Doing some searching, I think we may be able to solve it by using SELECT ... FOR UPDATE at the right point in time. For example, when retrieving the trips that may be associated to some vehicles, we could use SELECT ... FOR UPDATE instead of vanilla SELECT. This will result in the Postgres trip rows being locked for writing, in the transaction that is updating vehicles. Then, if a concurrent update for trip starts, the second trips transaction will wait for the first vehicles transaction to release the lock on the trips before continuing.

In terms of how to move forward on the PR, I really don't mind either option (trip_pk or trip_id). I think that we could iterate on the feature after this first initial PR is submitted. (Always supporting non-full entities and solving the deadlocks issue both sound interesting to me and I would be up for coding either or both). So overall my thought is that while this PR does implement most of the vehicles feature, we can keep working on it after submission and so this PR doesn't need to be the final gold-plated version at all.

cedarbaum commented 11 months ago

Make sense, thanks! Given the discussion so far, I went ahead and went with the trip_pk approach for this PR. I think it probably does still have some deadlock/perf issues, but these are hopefully addressable in the future and, like you pointed out initially, is schematically cleaner. Sorry about the back and forth on this; I do really appreciate the discussion!

I've sent a revision that uses the original trip FK design. The main changes from the original revision will be at the query layer and the updateVehicles function. I believe I've also addressed the other initial comments that you had.

jamespfennell commented 11 months ago

SGTM! Just left one last question about the API and then can submit :)

jamespfennell commented 11 months ago

Thank you very much for the PR Sam! Including the detailed information in the PR description and your patience going through multiple rounds of review! It's great to have this feature and was a decent chunk of work.

By the way, just a note that when I submit PRs I generally squash merge the branch. So writing a PR with multiple commits is totally fine, if that makes life easier for you!