Closed vladsud closed 3 years ago
fluid:telemetry:Summarizer:Running:FailToSummarize | | 443 | fluid:telemetry:Summarizer:Running:SummaryAckWaitTimeout | | 432 | fluid:telemetry:SummaryManager:ElectedClientNotSummarizing | | 293 | fluid:telemetry:SummaryStatus:Behind | | 97 | fluid:telemetry:Summarizer:HandleSummaryAckError | 0x1a6 | 29 | fluid:telemetry:DeltaManager:GetDeltas_Exception | Failed to retrieve ops from storage: too many retries | 3 | fluid:telemetry:Container:ContainerClose | Failed to retrieve ops from storage: too many retries | 3
MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 removeMember listeners added to [QuorumProxy]. Use emitter.setMaxListeners() to increase limit | 152 |
Some analyzes. First, likely less interesting issues - we need to look into them all, but my intuition / knowledge tells me they are not blocking scalability testing and not (by themselves) are the issues that are manifesting to users:
Which leaves are with these more important things to look into:
Looking at how long it takes to hear ack in successful cases - I believe this is captured in timeWaiting
property of fluid:telemetry:Summarizer:Running:Summarize_end event. There are 8283 events, 99th percentile is at 70s. There are 124 cases over a minute and it cuts off at 2 minutes (due to timeout). If we assume equal probabilities of any duration in a fixed range and no other reason for SummaryAckWaitTimeout, then 432 time outs would mean some summaries take up to 3.5 minutes to hear an ack. It is hard (without looking deeper / having access to resulting files) to know if ack was ever issued or not, but given we do not see other failures (including using wrong parent summary), I assume these timeouts actually never result in an ack.
@arinwt - FYI, please take a look. Would be great if you can double check my logic. Some feedback to you:
Trying to better understand SummaryAckWaitTimeout.
First, I've looked if in general SPO calls around summarization are slow. I.e. client side of posting summaries - there are no hits above 5 seconds:
customEvents | where timestamp between(datetime(2021-06-23T10:30:00+530) .. datetime(2021-06-23T12:30+530)) | where customDimensions.testUid == "aaea96a8-a370-47b3-b9b5-287a5240ef55" | extend eventName = tostring(customDimensions.eventName) | where eventName contains "uploadSummary_end" | extend duration = toint(customDimensions.duration) | where duration > 5000 | summarize count()
Picking one file that saw 5 timeouts (max per file) - all 5 happened in a row, i.e. we did not hear anything from storage for 10 minutes total, and next summary after that actually summarized successfully, meaning timeout did not end up being actually acked under the covers! 3 more files are in same bucket, but one saw errors and successes mixed.
customEvents | where timestamp between(datetime(2021-06-23T10:30:00+530) .. datetime(2021-06-23T12:30+530)) | where customDimensions.testUid == "aaea96a8-a370-47b3-b9b5-287a5240ef55" | extend eventName = tostring(customDimensions.eventName) | where eventName contains "uploadSummary_end" or eventName contains "Timeout" | where customDimensions.docId == "1kZ9k%2FOYKxEKShYZ7OSxLi8%2Bi3hUGUZt8DopSkNIA5M%3D" | project timestamp, eventName, toint(customDimensions.summarySequenceNumber), tostring(customDimensions.sprequestguid) | order by timestamp asc
Given that we do not log sprequestguid
for successful calls, and we can't log SPO's itemID, I can't provide any identification that SPO can use other than PUSH ids (clientId, socket document ID).
And here is how such failures look in time across all files (1m granularity):
customEvents | where timestamp between(datetime(2021-06-23T10:30:00+530) .. datetime(2021-06-23T12:30+530)) | where customDimensions.testUid == "aaea96a8-a370-47b3-b9b5-287a5240ef55" | extend eventName = tostring(customDimensions.eventName) | where eventName contains "Timeout" | summarize count() by bin(timestamp, 1m) | render timechart
Compare that with total number of summaries going through the system in this run:
Based on graphs, something happened around 5:12-5:14 when we had highest load of summaries and they kept failing from that moment till the end of the run (though spike was around smaller timeframe around where it started)
Bigger run: testUid == "2ff41d63-688e-4283-9b57-95a8c56cd2e9"
Latest issue identified: https://github.com/microsoft/FluidFramework/issues/6596
Lack of acks in telemetry (with recent changes) makes it really hard to see the progression of events in a file. I.e. when we get ack timeout, I want to be able to see if ack was observed later for that summary (by that or other clients). I do not see any events with "ack" in then other than nacks & timeouts in latest runs.
Acks are now "Summarize_end" events with message "summaryAck". The message is actually redundant, since all other code paths will result in "Summarize_cancel".
Do we log number of ops we were summarizing? In situation where there are some summary failures (which is any long stress run), the primary metric to give or not green light is the max unsummarized op tail at any moment in time. Leaving the end state of the system for now, the max summarized ops will tell us that. Do we log it?
We do sort of log this in "GenerateSummary" events with property: "opsSinceLastSummary"- this is the count of ops since the last summary ack: referring to the corresponding summarize op's reference sequence number, so we actually are counting the ops being summarized. There is also "opsSinceLastAttempt" which count ops since we last tried to summarize.
Note that "GenerateSummary" event should fire even if summary process fails.
I do think there's a need for correlating summarize attempts with their retry attempts and eventual failure. It's possible now (summaryGenTag increments by 1 with each retry, and the container id remains the same, and the reason changes from X to "retry1" to "retry2"), but not feasible for aggregating.
I think the solution could be as simple as not incrementing summarizeCount (summaryGenTag) on retries. This works well for "Summarize" events, but for any other downstream events that leverage the summaryLogger it might be better to have a 2 part schema for summaryGenTag: "3.2" for second retry (annoying to query), or we could just add another property: "retry" that increments on the logger, etc. But the solution will probably involve adding another property to the logger.
Also I can confirm that your analysis looks correct to me. I think the main point about "SummaryAckWaitTimeout" leading to "FailToSummarize" is almost definitely right, and probably it is related to server downtime as indicated in the other issue.
I do think there are probably also missing acks in many of these cases as well, but for at least some cases, the client can recover on retry. This is not clearly indicated in the telemetry you linked, but as documents are recovering, I think it still holds true.
Latest TAP40 run: 980e2dc0-f8ef-4661-9d5d-5d1dc8ed4630
Key errors (counts above 500), grouping into related groups:
| fluid:telemetry:ElectedClientNotSummarizing | | 658,485 | | fluid:telemetry:SummaryStatus:Behind | | 40,387 | | fluid:telemetry:Container:ContainerWarning | Nack: Submit a summary before inserting additional operations | 33,611 | | fluid:telemetry:Summarizer:Running:SummaryAckWaitTimeout | | 4,970 | | fluid:telemetry:Summarizer:Running:FailToSummarize | | 842 | | fluid:telemetry:Container:storage_uploadSummaryWithContext | Error 404 (Not Found) | 504
| fluid:telemetry:DeltaManager:GetDeltas_Exception | Failed to retrieve ops from storage: too many retries | 10,175 |
| fluid:telemetry:SummaryManager:CreateSummarizerError | Container was closed while load() | 4,352 | | UnhandledPromiseRejection | 0x0d0 | 4,346 | | RunnerFailed | disposed | 2,482 | | fluid:telemetry:Container:ContainerClose | 0x07b | 517 |
Tap30: c79d0f4c-1019-4b29-9a2c-275475d1d0e8 Looks much cleaner - other than fluid:telemetry:ElectedClientNotSummarizing & fluid:telemetry:SummaryManager:CreateSummarizerError (same as above), it does not have any sizable summarizing errors.
One more thing to add - fluid:telemetry:Summarizer:Running:Summarize_cancel escaped my queries because it does not have category = "error".
Here are top errors: | fluid:telemetry:Summarizer:Running:Summarize_cancel | Failed to summarize the document. | generic | 54,727 | | fluid:telemetry:Summarizer:Running:Summarize_cancel | disconnected | generic | 18,665 | | fluid:telemetry:Summarizer:Running:Summarize_cancel | | generic | 8,297 | | fluid:telemetry:Summarizer:Running:Summarize_cancel | Error 404 (Not Found) | generic | 5
fluid:telemetry:Summarizer:Running:GenerateSummary is similar - it logs successes & failures into one bucket:
| fluid:telemetry:Summarizer:Running:GenerateSummary | | generic | 535,120 | | fluid:telemetry:Summarizer:Running:GenerateSummary | disconnected | generic | 18,665 | | fluid:telemetry:Summarizer:Running:GenerateSummary | {"message":"Error 404 (Not Found)","code":""} | generic | 504
Looks like all the issues (including "too many retries") are connected to PUSH being throttled. And we are hitting 10K unsummarized ops all the time, so system (as it stands) is beyond capacity. Throttling limits are evaluated, but we also need more reliable mechanism to deal with situations like that. One of the biggest problems today is that there is no visibility on client side when PUSH is throttled while flushing ops.
One thing to consider:
If client had ability to fetch ops from redis (indirectly, possibly through fetching OPS from SPO), then (1) would have a recovery path. And it would also allow PUSH not to flush ops so frequently (on client connection), and thus reduce chances of being throttled and reduce impact on summary failures. It should substantially reduce pressure.
That might be also a stepping stone for single-phase summary commit, i.e. SPO pulling ops from push on summary upload. This might eliminate almost all the paths where PUSH flushes ops (no clients in session would be primary reason for flushes, flushing on timer would be rare as most of the time client would summarize and cause flush that way, also receiving throttling feedback from SPO that way).
We should also consider alternative in the form of push (i.e. client asking PUSH to flush ops). With proper feedback (success/failure reporting from PUSH) client would be able to leverage it same workflows (gap in ops, summarization). A bit harder to stage, and a bit more complicated protocol, but possibly simpler on server side and fewer dependencies of SPO on PUSH.
Some look at overall activity in TAP40 run (resolution - 1 minute)
Non-error & error rates (blue - not errors, orange - errors):
Container loading rate (blue - regular client, orange - summarizer clients):
Disconnect rates, that drive summarizer reboot (blue - regular client, orange - summarizer clients):
And "Failed to summarize document" errors from PUSH Note that these errors start around the time when we experience wave of disconnects for summarizer, but they keep going till the end of the run.
Here are all summarization attempts - successful (blue) and not (orange, most of them are the "Failed to summarize document" errors from above):
There were 33K non-summarizing containers, and 56K summarizing containers!
Here are all the reasons for disconnects: | socket.io:server_disconnect | 124,987 | | Nack: Readonly client | 90,457 | | Disconnect: ping timeout | 35,494 | | Nack: Submit a summary before inserting additional operations | 33,612 | | Container closed | 30,624 | | Disconnect: transport close | 19,492 | | Failed to retrieve ops from storage: too many retries | 10,090
Same, but only looking at summarizing containers: | socket.io:server_disconnect | 42,125 | | Container closed | 8,572 | | Disconnect: ping timeout | 3,349 | | Disconnect: transport close | 1,939 | | Failed to retrieve ops from storage: too many retries | 398
The bits from main moved logging of reason for server disconnect into socketError property, but I see nothing there. That said, from previous runs, it's pretty clear it's token expiry:
Disconnect: socket.io: server_disconnect: TokenExpired
Disconnects and impact on stress tests are tracked in https://onedrive.visualstudio.com/SPIN/_queries/edit/1156694/?triage=true. This seems like test config issue, given such a huge spike of expirations at exactly same time - the intention is to spread out token expirations to better reflect real usage. Data suggests test infrastructure is not doing what it is supposed to do.
I'm closing this issue as it served its purpose. All the issues identified here are logged as their respective github issues. Epic issue will continue to track overall process and any new investigations on as needed bases.
Openning tracking issue to investigate results of latest ODSP stress test run:
customEvents | where timestamp between(datetime(2021-06-23T10:30:00+530) .. datetime(2021-06-23T12:30+530)) | where customDimensions.testUid == "aaea96a8-a370-47b3-b9b5-287a5240ef55" | where customDimensions.category == "error" | extend eventName = tostring(customDimensions.eventName) | extend error = tostring(customDimensions.error) | summarize count() by eventName, error
Logs