Open jamespfennell opened 11 months ago
Updated in light of the submitted version of #110.
One additional complication for this work is that multiple feeds may try to insert the same entity concurrently, violating unique key constraints. For example, both the trip
and vehicle
feed start updates, both see that "vehicle 1" doesn't exist, and then both try to add it, guaranteeing one transaction fails and rolls back.
I think to fix this we would need some stronger consistency guarantees within the update transactions. The most granular mechanism I can think of would be a per-system write lock for the trip
and vehicle
tables (maybe using something like advisory locks).
Alternatively, updates could be scheduled in such a way that they don't overlap within a system.
Curious what your thoughts on above are!
Yeah that's definitely an issue. Also, I tried to write some code to implement this ticket I think it will be complex...
I've been thinking, should we just skip all entities that have IsEntityInMessage: false
? Basically what you did for the NYC buses with a configuration value, but just hardcoded in the code.
The original motivation for persisting IsEntityInMessage: false
vehicles was that I think I've observed for the NYC subway there are sometimes trip updates that have a vehicle descriptor attached but no vehicle position entity. In this case you get a vehicle with IsEntityInMessage: false
and my idea was that it would be still nice to persist that vehicle data. However I think the complexity is really high and on balance it's likely not worth it.
This sounds good to me, though I suppose we could also invert the option to be "opt in" to cover the subway vehicle case you mentioned (e.g., persistEntityReferences
or something similar).
Also, to avoid leaving the deprecated option in the proto definition, we could change Feed config parsing to allow unknown fields and simply remove it. The trade-off here is easier deprecation of short-lived features such as this vs. allowing typos to slip through potentially. Let me know your thoughts on this; I have a branch with the proposed change here: https://github.com/cedarbaum/transiter/commit/196758d2b254456ad24a9faa71916b1bf78cd237
Yeah the inverted option does sound good. It could come with a warning that vehicles and trips must be in the same feed, to avoid the races you mentioned. In this way the implementation would be simpler. For the moment I'm happy to just submit the branch I linked to, and if we want later we could add such an opt-in option.
Changing feed config parsing sounds great!! I didn't know we could do that!
Awesome, just sent a PR for the feed config parsing change: https://github.com/jamespfennell/transiter/pull/123/files
Also I think updateTrips
still needs to be modified to ignore partial entities by default. If we're OK to go ahead and just delete the old config option entirely, I can take care of fixing that as well.
Uhhhh right we do need to handle trips! That would be great.
Every GTFS realtime trip is optionally associated to a vehicle (and vice versa). Naively, this relationship would be represented by a nullable foreign key constraint, sayvehicle.trip_pk
. However in #110 when developing the first iteration of the vehicle feature, @cedarbaum discovered that this is non-trivial because of how the underlying GTFS library reports trips and vehicles.There is a problem in the case (like the NYC bus) when trips and vehicles are reported in different GTFS realtime feeds. The problem is that in the trips feed the GTFS library adds a vehicle "stub" (a small vehicle message tagged with
IsEntityInMessage: false
) and in the vehicles feed adds a similar trips stub. When parsing the trips feed Transiter tries to update the vehicle using the stub which is definitely wrong because the vehicle data is in the other feed. The stub only exists to communicate the trip-vehicle relationship.In #110, @cedarbaum solved this by adding a
only_process_full_entities
field to the feed configuration in which caseIsEntityInMessage: false
entities are skipped.I think though we can probably handle this case without explicit configuration. One idea I had for solving this is to persist the
IsEntityInMessage
field in the database for both trips and vehicles. Then in the GTFS realtime updater, we would skip updating a trip if all of the following conditions hold:IsEntityInMessage: false
IsEntityInMessage: true
One way to think about this is that when a real trip exists (i.e.
IsEntityInMessage: true
) there is a unique feed that owns that trip. Other feeds should not update that trip. We would do a similar thing for vehicles.