Summarizer may not be able to summarize

vladsud commented 4 years ago

I experience that with Chapter 2 document in presence of https://github.com/microsoft/FluidFramework/issues/4068 But the issue is likely larger than that, i.e. may apply to any summary.

The problem is that if summarizer losses connection to socket, then it starts over. This might work Ok in normal cases, but if the loss of connectivity is due to summarizer work itself, then there is no hope of retry.

The reasons why summarizing process may be long:

It takes too long for summarizer to rehydrate all data stores / DDSs, download and process all the blobs and ops and generate new payload.
Uploading summary (both network and SPO processing / ack) may take substantial amount of time, especially on slow networks.

The actual reasons for socket disconnect can be

During this time, it may not yield enough to allow socket to process ping/pong messages, and thus socket is closed (currently 30 seconds). (related: #3969)
Summarizer client does not issue any ops during this time. Storage disconnects clients who keep silence / do not move Collab window forward. I believe currently that's 3 minutes on SPO
Connection hits idle timeout. I'm not sure "idle" is actual signal, I believe SPO disconnects every socket every 15 minutes.

At scale, it's just a matter of time when we hit a combination of reasons from these two lists. Our summarization process should somehow be resilient to these issues.

I think there has to be a mode where clients realize that we are losing file, and then summarizer clients move to a mode where they keep reconnecting (and possibly have many summarizer clients on the wire), but attempt to push summary through. I do not know what is the implication here RE ref seq number, as reconnected summarizer likely is outside of collab window and can't issue summaryOp without some more work / thinking here.

vladsud commented 4 years ago

@anthony-murphy - FYI (can't put you as assignee, githiub picker sucks )

vladsud commented 4 years ago

In case of Chapter 2 document, problems were gone once summary payload size became smaller. I did not measure what was the size, but it likely was in 100mb+ (due to cascading safe summary but where each next summary carried bigger and bigger payload of ops)

microsoft / FluidFramework

Summarizer may not be able to summarize #4069