microsoft / FluidFramework

Library for building distributed, real-time collaborative web applications
https://fluidframework.com
MIT License
4.73k stars 532 forks source link

Proposal: DDS summary representation should be change to support retention policies #9545

Closed vladsud closed 2 years ago

vladsud commented 2 years ago

Applications build on Fluid Framework quite often need to adhere to various retention policies when it comes to content that no longer is in the document from user POV. An example of that would be a paragraph deleted by a user from a document and application (or ecosystem that this app is part of) promising that all deleted content is gone (and no longer retrievable) from storage.

Any solution in this space will likely require ability to summarize file at rest to get rid of trailing ops. That could assume collab window being collapsed to be empty, and thus summary not to include previous edits. But that likely is not sufficient as

  1. It's preferred for such process to continue to be not expensive, and do only minimum work, including not summarizing parts of document that did not change from previous summary.
  2. Solutions in this space likely want to optimize and not invoke such process if trailing ops contain only system ops (like join, summary, leave ops) as no document state changed from last summary. Many apps may be designed in a way to ensure last summary or increase the chance of last summary before exit, thus reducing the need by storage layer to run summarizer process as a service.

As result, there is a need for summarization format of DDSs to be structured in a way where

  1. Some blobs track latest state (at the moment of summary) and no history.
  2. Some blobs contains extra info to reason about ops coming later in time, within collab window (at the time of summary).

Basically, reverse of what Sequence does today, where # 1 is state at MSN and # 2 tracking changes from # 1 into future - later should look back in history.

This kind of design will allow us to mark blobs in second bucket to be deleted by service when certain rules are met like

  1. no more trailing ops other than system ops
  2. collab session is closed (no more ordering session for this document, thus collab window is empty).
vladsud commented 2 years ago

@anthony-murphy, @DLehenbauer, @taylorsw04 - FYI.

anthony-murphy commented 2 years ago

i think we can solve the sequence issue around catch up ops. we have another snapshot format that we haven't rolled out, which stores deleted content inline, and rehydrates the deleted segments. rather than keeping the ops. we could actually change this to just store the length, and we don't actually need the deleted content. search would need to know how to parse the new format. and it need some stabilization before it would be safe to roll out. this won't solve the problem for other dds (if any exist with the problem), and it won't solve the issue of trailing ops outside the summary aka > summary ref seq

ghost commented 2 years ago

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!