Closed vladsud closed 3 years ago
@anthony-murphy - FYI (can't put you as assignee, githiub picker sucks )
In case of Chapter 2 document, problems were gone once summary payload size became smaller. I did not measure what was the size, but it likely was in 100mb+ (due to cascading safe summary but where each next summary carried bigger and bigger payload of ops)
I experience that with Chapter 2 document in presence of https://github.com/microsoft/FluidFramework/issues/4068 But the issue is likely larger than that, i.e. may apply to any summary.
The problem is that if summarizer losses connection to socket, then it starts over. This might work Ok in normal cases, but if the loss of connectivity is due to summarizer work itself, then there is no hope of retry.
The reasons why summarizing process may be long:
The actual reasons for socket disconnect can be
At scale, it's just a matter of time when we hit a combination of reasons from these two lists. Our summarization process should somehow be resilient to these issues.
I think there has to be a mode where clients realize that we are losing file, and then summarizer clients move to a mode where they keep reconnecting (and possibly have many summarizer clients on the wire), but attempt to push summary through. I do not know what is the implication here RE ref seq number, as reconnected summarizer likely is outside of collab window and can't issue summaryOp without some more work / thinking here.