Realtime update snapshots: Documentation / architecture

abyrd commented 9 months ago

When discussing #5364 we realized that not everyone is familiar with how the realtime updates are applied, how and when snapshots are made, when and how they are cleaned up etc. Everyone would like to get all these details down in diagrams and text rather than in our heads.

@t2gran would like to have a diagram that shows:

How snapshots relate to transit routing model
How updaters interact with snapshots
Snapshot lifecycle

@leonardehrenfried mentioned the "transit layer" and how it's different than / related to the snapshot. @t2gran says when implementing the raptor transit data provider, the example of a transit data source was from R5 where it's called TransitLayer.

Let's accumulate open questions and observations about realtime updating and snapshots on this Github issue, and we'll eventually consolidate them into a document (Google doc or just markdown docs and diagrams within the OTP repo).

How updaters are interleaved or scheduled (single/multiple threads) needs to be clear.

Thomas mentioned the "4+1 view" which implies that the concurrency should not be explained in the same diagram as the flow of messages between components (or the class diagram / data model for that matter).

@leonardehrenfried points out some code that he wrote, where it wasn't clear why it was necessary. https://github.com/opentripplanner/OpenTripPlanner/blob/c7603c9b2b2fe7dfdefb32c767a269ca486df691/src/main/java/org/opentripplanner/updater/vehicle_position/VehiclePositionPatternMatcher.java#L325-L337

Realtime updated timetables contain only some of the information from scheduled timetables. This comes from circumstances during the original realtime implementation: only one OTP deployment had streaming realtime updates, so most of the time this feature was switched off. The updates were patches on top of the scheduled timetable.

@t2gran points out an existing document on refactoring the transit data model (see #4002). Work has started on this in a branch and some preparatory PRs are merged.
Much of the Entur team is comfortable with draft v31 (https://github.com/opentripplanner/OpenTripPlanner/issues/4002#issuecomment-1117042868) as the basis for the new transit model.

Enforcing copy-on-write strategy, encapsulating all the fields that can be updated in the transit data, will be quite voluminous in Java as it must be implemented field by field for every part of the transit data model. At first, years ago, manually enforcing the policies on how updaters work and when fields should be copied etc. seemed no more error prone. But as a long term project with many developers this investment in encapsulation and policy enforcement is now worth it. Also, at first the format of realtime messages and their associated functionality was totally new and subject to change. It is much more stable now, so effort invested is less likely to be wasted.

Some may be skeptical of the encapsulation, enforcement, builders... but talking through the reasons and advantages it becomes convincing.

@vpaturet comments that in production one can see concurrent modification exceptions due to realtime updates. @t2gran points out that there are things that are not protected against concurrent access, people were not aware of this. For example indexes were not being updated and were not protected in the same way as timetables. The index is not subject to snapshot management like timetables are.

Realtime snapshots are not guaranteed to continue to exist across API calls. Clients must be aware that IDs found in one request may not exist in a subsequent request. Supporting transactions across several API calls is conceivable but no plans to support it (subsequent requests would need to go to the same router node), and we should be clear about lack of support for this.

Data consistency: updating spatial indexes concurrently will not be solved by the new transit model. GBFS continuously updates spatial indexes with no concurrency management. It seems possible to implement memory-efficient multi-version concurrency control on the spatial indexes, with the relevant add/remove methods encapsulating the logic so updaters don't need to know how it works.

Do we really need consistency of the spatial index over a whole request or transaction? Probably not, we don't mind getting some outdatated data as long as it doesn't show exceptions. Copy-on-write or readers-writer locks, or what? COW uses more memory but is less subject to deadlock and slow joiner problems, misunderstandings - generally we want to use the simplest synchronization approaches possible. COW of a list containing e.g. 500 bicycles is still "wasting" less memory than many things we do during normal routing. And transit schedule updates contain a lot more data being applied with a similar copy-on-write strategy. @vpaturet points out that we see far more transit data concurrency errors than GBFS / spatial index updater concurrency errors.

Paging via re-issuing requests with new time windows is not really adversely affected by using different snapshots on subsequent requests.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days

opentripplanner / OpenTripPlanner

Realtime update snapshots: Documentation / architecture #5365