jamespfennell / transiter

Web service for transit data
https://demo.transiter.dev
MIT License
55 stars 6 forks source link

Automatically handle entities irrespective of `IsEntityInMessage` #114

Open jamespfennell opened 11 months ago

jamespfennell commented 11 months ago

Every GTFS realtime trip is optionally associated to a vehicle (and vice versa). Naively, this relationship would be represented by a nullable foreign key constraint, say vehicle.trip_pk. However in #110 when developing the first iteration of the vehicle feature, @cedarbaum discovered that this is non-trivial because of how the underlying GTFS library reports trips and vehicles.

There is a problem in the case (like the NYC bus) when trips and vehicles are reported in different GTFS realtime feeds. The problem is that in the trips feed the GTFS library adds a vehicle "stub" (a small vehicle message tagged with IsEntityInMessage: false) and in the vehicles feed adds a similar trips stub. When parsing the trips feed Transiter tries to update the vehicle using the stub which is definitely wrong because the vehicle data is in the other feed. The stub only exists to communicate the trip-vehicle relationship.

In #110, @cedarbaum solved this by adding a only_process_full_entities field to the feed configuration in which case IsEntityInMessage: false entities are skipped.

I think though we can probably handle this case without explicit configuration. One idea I had for solving this is to persist the IsEntityInMessage field in the database for both trips and vehicles. Then in the GTFS realtime updater, we would skip updating a trip if all of the following conditions hold:

One way to think about this is that when a real trip exists (i.e. IsEntityInMessage: true) there is a unique feed that owns that trip. Other feeds should not update that trip. We would do a similar thing for vehicles.

jamespfennell commented 11 months ago

Updated in light of the submitted version of #110.

cedarbaum commented 11 months ago

One additional complication for this work is that multiple feeds may try to insert the same entity concurrently, violating unique key constraints. For example, both the trip and vehicle feed start updates, both see that "vehicle 1" doesn't exist, and then both try to add it, guaranteeing one transaction fails and rolls back.

I think to fix this we would need some stronger consistency guarantees within the update transactions. The most granular mechanism I can think of would be a per-system write lock for the trip and vehicle tables (maybe using something like advisory locks).

Alternatively, updates could be scheduled in such a way that they don't overlap within a system.

Curious what your thoughts on above are!

jamespfennell commented 11 months ago

Yeah that's definitely an issue. Also, I tried to write some code to implement this ticket I think it will be complex...

I've been thinking, should we just skip all entities that have IsEntityInMessage: false? Basically what you did for the NYC buses with a configuration value, but just hardcoded in the code.

The original motivation for persisting IsEntityInMessage: false vehicles was that I think I've observed for the NYC subway there are sometimes trip updates that have a vehicle descriptor attached but no vehicle position entity. In this case you get a vehicle with IsEntityInMessage: false and my idea was that it would be still nice to persist that vehicle data. However I think the complexity is really high and on balance it's likely not worth it.

jamespfennell commented 11 months ago

Basically this: https://github.com/jamespfennell/transiter/compare/only-persist-full-entities?expand=1

cedarbaum commented 11 months ago

This sounds good to me, though I suppose we could also invert the option to be "opt in" to cover the subway vehicle case you mentioned (e.g., persistEntityReferences or something similar).

Also, to avoid leaving the deprecated option in the proto definition, we could change Feed config parsing to allow unknown fields and simply remove it. The trade-off here is easier deprecation of short-lived features such as this vs. allowing typos to slip through potentially. Let me know your thoughts on this; I have a branch with the proposed change here: https://github.com/cedarbaum/transiter/commit/196758d2b254456ad24a9faa71916b1bf78cd237

jamespfennell commented 11 months ago

Yeah the inverted option does sound good. It could come with a warning that vehicles and trips must be in the same feed, to avoid the races you mentioned. In this way the implementation would be simpler. For the moment I'm happy to just submit the branch I linked to, and if we want later we could add such an opt-in option.

Changing feed config parsing sounds great!! I didn't know we could do that!

cedarbaum commented 11 months ago

Awesome, just sent a PR for the feed config parsing change: https://github.com/jamespfennell/transiter/pull/123/files

Also I think updateTrips still needs to be modified to ignore partial entities by default. If we're OK to go ahead and just delete the old config option entirely, I can take care of fixing that as well.

jamespfennell commented 11 months ago

Uhhhh right we do need to handle trips! That would be great.