microsoft / FluidFramework

Library for building distributed, real-time collaborative web applications
https://fluidframework.com
MIT License
4.74k stars 531 forks source link

Surface Azure Clients APIs supporting recovery process #9651

Closed ssimic2 closed 2 years ago

ssimic2 commented 2 years ago

Overview

As discussed in #9442, the core goal of this API is to provide ability to extract document data from corrupted container. For V1, this API will not go into details on what user does with recovered data, or ability to rollback existing documents.

Features / APIs

We will expose two new APIs on Azure Client as a part of v1.0 efforts.

  1. Ability to retrieve history/versions of particular document. We already have ability to retrieve doc. versions through document service. The gap here is to surface versioning data through Azure client.
    
    interface ContainerVersion {
    id: string,
    date: string,
    }

public async getContainerVersions( id: string, ): Promise<ContainerVersion[]>


2. **Ability to recreate container from particular version of another document.**

public async reCreateContainer( containerSchema: ContainerSchema, documentId: string, version: string, ): Promise<{ container: IFluidContainer; services: AzureContainerServices; }>



### Addressing the Needs

- By allowing user to query multiple versions/snapshots of the document we are increasing the likelihood that document data can be recovered. This will especially be valuable when latest snapshot is the source of corruption. If that is the case, user can query further (older) versions of the document.

- Within Azure API we could attempt to cycle through multiple versions/snapshots of the document while trying to converge on the "valid" one. However, simply loading the snapshot does not imply valid document, as DDS-es are loaded only when accessed. Here, we are letting the user complete the load and decide if all relevant data was extracted, or further attempts are needed.

### Data Recovery Scenarios

We are not assuming exact usage of this API, but here are some general thoughts.

(1) "Corrupted" container can be a transient problem, so not every corrupted doc needs recovery. Simply restarting the session may solve the problem.

(2) Given that any client should conclude on the same answer about "recovered" state, the recovery could be attempted by all clients and then let first one win. This solution implies multiple "throw away" documents being created, but given that recovery is not a frequent operation, this should not have significant impact on cost.

(3) There could be an improved solution on (2) if one client is able to effectively end collaboration session for all clients and kick off recovery. We have separate efforts around this feature. However, given we are dealing with corrupted document we will likely have limitations here.

(More to follow through examples)

### Extended Goals

We will use new APIs for recovery purposes, but certainly we can consider it down the line for:
- Arbitrary rollbacks.
- General copying of documents.

Use-cases deck:
[https://microsoft-my.sharepoint-df.com/:p:/p/sashasimic/EXppEEFMYP5IuM9bDya5-rcBh68ICLwaS_O2FY0NRur9NA](https://nam06.safelinks.protection.outlook.com/ap/p-59584e83/?url=https%3A%2F%2Fmicrosoft-my.sharepoint-df.com%2F%3Ap%3A%2Fp%2Fsashasimic%2FEXppEEFMYP5IuM9bDya5-rcBh68ICLwaS_O2FY0NRur9NA&data=04%7C01%7Csashasimic%40microsoft.com%7Cfffda5287aac4473d87208da0c2edc54%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637835692656551068%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=J9XIDhQ7ASkDoF3Tuvb5wYgC5bYdioWb2NJ4180B1r8%3D&reserved=0)
ssimic2 commented 2 years ago

@skylerjokiel @ChumpChief @heliocliu would love to hear your feedback on this ticket.

vladsud commented 2 years ago

This flow requires storage to support creation of files from summary with arbitrary sequence number. Today, SPO supports only creation of files with sequence # = 0. I'm not sure about FRS, but I'd think it's the same.

Similar, all the DDSs today assume that detached file creation starts with zero. There might be other differences in state. I'm sure complex DDSs like Sequence will not be happy with this flow. It would be great to understand full design here.

ssimic2 commented 2 years ago

This flow requires storage to support creation of files from summary with arbitrary sequence number. Today, SPO supports only creation of files with sequence # = 0. I'm not sure about FRS, but I'd think it's the same.

Similar, all the DDSs today assume that detached file creation starts with zero. There might be other differences in state. I'm sure complex DDSs like Sequence will not be happy with this flow. It would be great to understand full design here.

I have moved design considerations to https://github.com/microsoft/FluidFramework/issues/9442. It talks about these 2 topics as well. On the call today, I'd like us to talk about (1), (2) vs (3) options listed there and discuss any general constraints & principles we don't want to violate here. We may not have a long leap here for DDSes to support non-zero seq. num. I'll share more on Sequence in particular today, but some of these observations will apply to other DDSes.