microsoft / FluidFramework

Library for building distributed, real-time collaborative web applications
https://fluidframework.com
MIT License
4.73k stars 532 forks source link

For summarizing, fetch snapshot fails when the snapshot is too large #8962

Closed andre4i closed 2 years ago

andre4i commented 2 years ago

https://portal.microsofticm.com/imp/v3/incidents/details/286391295/home

There is a behavior captured in the above incident which seems to indicate that what is specified here does not hold and the size limit is actually taken into consideration further down in the code when this.hostPolicy.summarizerClient is enabled.

We need to confirm that the limit does not apply to the summarizer (preferably adding some tests for the specific scenario). And if it does not apply, we need to follow how the summarizerClient flag is managed.

Update

Snapshot limits should be disabled when requests are coming from a summarizer. See: https://github.com/microsoft/FluidFramework/blob/51d2d7b5541e2ad344d00509ff127f58fd792ea9/packages/drivers/odsp-driver/src/odspDocumentStorageManager.ts#L585

The property is set by the OdspDocumentService which it turn reads it from the odspResolvedUrl.summarizer property: https://github.com/microsoft/FluidFramework/blob/51d2d7b5541e2ad344d00509ff127f58fd792ea9/packages/drivers/odsp-driver/src/odspDocumentService.ts#L149

However, the resolved url is not set properly.

The summarizer does try to set it https://github.com/microsoft/FluidFramework/blob/51d2d7b5541e2ad344d00509ff127f58fd792ea9/packages/runtime/container-runtime/src/summarizer.ts#L126 but then the relative loader will just reuse the container's url: https://github.com/microsoft/FluidFramework/blob/51d2d7b5541e2ad344d00509ff127f58fd792ea9/packages/loader/container-loader/src/loader.ts#L77

This needs to be consistent and fixed.

As an implementation hint, it would probably make more sense to rely on the client type (summarizer vs interactive) and have the relative loader propagate the client type (for all requests). Then, the OdspDocumentService can adjust policy accordingly to the client type.

andre4i commented 2 years ago

@vladsud A good question was raised today during our sync-up about this snapshot size behavior difference between summarizers and non-summarizers and I think you'd be able to shed some light on this.

Assuming we are able to send a large snapshot and this bug is fixed. Only the client who summarizes is able to download it, right? Is this a resilience mechanism to hope for the garbage collector to purge the snapshot which would afterwards (hopefully) allow all other clients to download it? In other words, why are we allowing one client to download the snapshot when nobody else can?

vladsud commented 2 years ago

Teams client wanted it to better control bandwidth consumption. Their take - if files are large, they want to render link in chat and let users decide if they want to follow link or not (and thus pay the cost or not). And yes, other clients may open it just fine.

Summarizer has to be excluded, such that if file grew 1 byte over the limit in this window (but was opened in Teams client before limit was increased), we are not leaving file unsummarized.

vladsud commented 2 years ago

@chensixx , any chance you can take a look next week? Sorry to dump on you the last week of February, but that issue is rather important, and pretty well understood. It was raised by Teams team through ICM, and somehow priority of this item (or rather the fact that it's well understood) escaped me.

Please take a look. If you have any questions, I, Andrei or Jatin can help with it.

The core of the issue is that we lose property tracking that it's a summarizer (as it's passed from loader headers to resolved URI to driver), and as result driver does not reset mds property that blocks large summaries.

vladsud commented 2 years ago

@chensixx, @andre4i, @jatgarg , can we prioritize this? This is rather bit reliability issue for Teams, I believe.

chensixx commented 2 years ago

question: when @andre4i said we should let relative loader propagate client type, doesn't request already have the clientType in the headers? am I reading this correctly?

chensixx commented 2 years ago

fix is in, still need to write unit test

chensixx commented 2 years ago

currently manually testing loader is passing the correct type to odsp driver, and for summarizer client, the mds limit and timeout are undefined.

chensixx commented 2 years ago

manual test completed, hostpolicy is set correctly