microsoft / FluidFramework

Library for building distributed, real-time collaborative web applications
https://fluidframework.com
MIT License
4.73k stars 532 forks source link

FRS: Data loss scenario: operations deleted from storage and not persisted in summaries are lost forever #5650

Closed hedasilv closed 1 year ago

hedasilv commented 3 years ago

Data loss scenario: operations deleted from storage and not persisted in summaries are lost forever

In #4732, we introduced a TTL to ops stored the MongoDB. However, we did not update the summarization logic accordingly. So when summary operations do not happen, or do not complete successfully, for a period longer than the TTL, we are possibly purging operations from MongoDB that were never incorporated into a summary. For example, we observed the following behavior:

  1. Clients connected to the document initially generate successful summaries.
  2. Our TTL was set as 20h
  3. At some point in time, all client summaries start to fail.
  4. There are other operations happening in the meantime (while summaries fail).
  5. The summary failures keep happening for more than 20 hours - after the 20 hour mark, operations start being deleted from MongoDB
  6. All clients leave the document, which initiates a service-side summary. The summary is considered successful, but the .protocol/attributes section indicate the sequence number of the summary is still stuck at the seq number of the last successful client summary (more than 20 hours ago).
  7. When clients try to start a session and connect to the document, they go into a bad state because the document is corrupted.

To Reproduce

Steps to reproduce the behavior: It is somewhat complicated to reproduce the situation locally, but we would need:

  1. Document with operations being generated for more than ops TTL
  2. Summary failures lasting more than ops TTL (or no summaries happening for any other reason) We could also try very low ops TTL to increase the likelihood of such problem happening.

Still, this type of unexpected behavior has already been observed by multiple partners using our service.

Expected behavior

Logs

Screen Shot 2021-03-26 at 10 24 59 AM

microsoft-github-policy-service[bot] commented 1 year ago

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!