Analyze latest ODSP stress test run

vladsud commented 3 years ago

Openning tracking issue to investigate results of latest ODSP stress test run:

customEvents | where timestamp between(datetime(2021-06-23T10:30:00+530) .. datetime(2021-06-23T12:30+530)) | where customDimensions.testUid == "aaea96a8-a370-47b3-b9b5-287a5240ef55" | where customDimensions.category == "error" | extend eventName = tostring(customDimensions.eventName) | extend error = tostring(customDimensions.error) | summarize count() by eventName, error

Logs

vladsud commented 3 years ago

fluid:telemetry:Summarizer:Running:FailToSummarize | | 443 | fluid:telemetry:Summarizer:Running:SummaryAckWaitTimeout | | 432 | fluid:telemetry:SummaryManager:ElectedClientNotSummarizing | | 293 | fluid:telemetry:SummaryStatus:Behind | | 97 | fluid:telemetry:Summarizer:HandleSummaryAckError | 0x1a6 | 29 | fluid:telemetry:DeltaManager:GetDeltas_Exception | Failed to retrieve ops from storage: too many retries | 3 | fluid:telemetry:Container:ContainerClose | Failed to retrieve ops from storage: too many retries | 3

MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 removeMember listeners added to [QuorumProxy]. Use emitter.setMaxListeners() to increase limit | 152 |

Some analyzes. First, likely less interesting issues - we need to look into them all, but my intuition / knowledge tells me they are not blocking scalability testing and not (by themselves) are the issues that are manifesting to users:

EventEmitter alerts are not blocking and should not impact stress test results.
"too many retries" was hit only 3 times (two distinct events fire in each case). It's much better than what we observed before, and the counts are low where unless there is very string signal it's a server bug, we should also do due diligence and see if there is a chance it is cased by some client bug.
fluid:telemetry:SummaryStatus:Behind are just warnings. They indicate there are 3K unsummarized ops. It by itself is not a big issue, but fluid:telemetry:SummaryManager:ElectedClientNotSummarizing is likely the reason we hit them and that's where the focus should be.
fluid:telemetry:SummaryManager:ElectedClientNotSummarizing fires by a client if there have not been summary for 3K ops. It's same as above, but it fires by each client, thus counts are larger.
fluid:telemetry:Summarizer:Running:FailToSummarize is a new event in latest runtime. It indicates failure to summarize 3 times in a row by a client trying various things to recover and giving up. Event payload does not have details, so we would need to look at each session separately to better understand what happened. That said, fluid:telemetry:Summarizer:Running:SummaryAckWaitTimeout has almost identical count and would cause FailToSummarize event if it happened 3 times in a row. Given that there are no other summarization events of similar volume, we can assume that indeed that is caused by SummaryAckWaitTimeout.

Which leaves are with these more important things to look into:

fluid:telemetry:Summarizer:Running:SummaryAckWaitTimeout - That's the main issue here. We fire this event after 2 minutes of not hearing ack or nack for a summary.
fluid:telemetry:Summarizer:HandleSummaryAckError | 0x1a6

Looking at how long it takes to hear ack in successful cases - I believe this is captured in timeWaiting property of fluid:telemetry:Summarizer:Running:Summarize_end event. There are 8283 events, 99th percentile is at 70s. There are 124 cases over a minute and it cuts off at 2 minutes (due to timeout). If we assume equal probabilities of any duration in a fixed range and no other reason for SummaryAckWaitTimeout, then 432 time outs would mean some summaries take up to 3.5 minutes to hear an ack. It is hard (without looking deeper / having access to resulting files) to know if ack was ever issued or not, but given we do not see other failures (including using wrong parent summary), I assume these timeouts actually never result in an ack.

vladsud commented 3 years ago

@arinwt - FYI, please take a look. Would be great if you can double check my logic. Some feedback to you:

fluid:telemetry:Summarizer:Running:FailToSummarize - we need some payload on this event to tell how summarize failed, such that we can aggregate it. Alternatively, I want to be able to filter only those summary failures that were part of 3 in a row.
- Maybe we can start with simple counter on every event, such that I can filter only events that happened on 3rd failed summary? It's not giving me full picture, but at least I can see why last step failed (and I can measure distribution of how many retries is required to succeed).
Names like timeWaiting do not tell much (without looking into code, and even then it's a bit hard to follow logic). Can you please glance at property names logged and make sure they are well named to tell full story?
(more generic version of above) I do not know if there is an easy solution here, but it's really hard to say looking at code what properties are available in Summarize_end event and what they mean. One needs to trace interface (expand them) and then trace code to see where certain properties are generated and what is their meaning / code around it. Anything we can do to make it better? At minimum, we should add comments around property interfaces saying what properties mean and where they are generated.
Lack of acks in telemetry (with recent changes) makes it really hard to see the progression of events in a file. I.e. when we get ack timeout, I want to be able to see if ack was observed later for that summary (by that or other clients). I do not see any events with "ack" in then other than nacks & timeouts in latest runs.
Do we log number of ops we were summarizing? In situation where there are some summary failures (which is any long stress run), the primary metric to give or not green light is the max unsummarized op tail at any moment in time. Leaving the end state of the system for now, the max summarized ops will tell us that. Do we log it?

vladsud commented 3 years ago

Trying to better understand SummaryAckWaitTimeout.

First, I've looked if in general SPO calls around summarization are slow. I.e. client side of posting summaries - there are no hits above 5 seconds:

customEvents | where timestamp between(datetime(2021-06-23T10:30:00+530) .. datetime(2021-06-23T12:30+530)) | where customDimensions.testUid == "aaea96a8-a370-47b3-b9b5-287a5240ef55" | extend eventName = tostring(customDimensions.eventName) | where eventName contains "uploadSummary_end" | extend duration = toint(customDimensions.duration) | where duration > 5000 | summarize count()

Picking one file that saw 5 timeouts (max per file) - all 5 happened in a row, i.e. we did not hear anything from storage for 10 minutes total, and next summary after that actually summarized successfully, meaning timeout did not end up being actually acked under the covers! 3 more files are in same bucket, but one saw errors and successes mixed.

customEvents | where timestamp between(datetime(2021-06-23T10:30:00+530) .. datetime(2021-06-23T12:30+530)) | where customDimensions.testUid == "aaea96a8-a370-47b3-b9b5-287a5240ef55" | extend eventName = tostring(customDimensions.eventName) | where eventName contains "uploadSummary_end" or eventName contains "Timeout" | where customDimensions.docId == "1kZ9k%2FOYKxEKShYZ7OSxLi8%2Bi3hUGUZt8DopSkNIA5M%3D" | project timestamp, eventName, toint(customDimensions.summarySequenceNumber), tostring(customDimensions.sprequestguid) | order by timestamp asc

Given that we do not log sprequestguid for successful calls, and we can't log SPO's itemID, I can't provide any identification that SPO can use other than PUSH ids (clientId, socket document ID).

And here is how such failures look in time across all files (1m granularity):

customEvents | where timestamp between(datetime(2021-06-23T10:30:00+530) .. datetime(2021-06-23T12:30+530)) | where customDimensions.testUid == "aaea96a8-a370-47b3-b9b5-287a5240ef55" | extend eventName = tostring(customDimensions.eventName) | where eventName contains "Timeout" | summarize count() by bin(timestamp, 1m) | render timechart

Compare that with total number of summaries going through the system in this run:

Based on graphs, something happened around 5:12-5:14 when we had highest load of summaries and they kept failing from that moment till the end of the run (though spike was around smaller timeframe around where it started)

vladsud commented 3 years ago

Bigger run: testUid == "2ff41d63-688e-4283-9b57-95a8c56cd2e9"

vladsud commented 3 years ago

Latest issue identified: https://github.com/microsoft/FluidFramework/issues/6596

arinwt commented 3 years ago

Lack of acks in telemetry (with recent changes) makes it really hard to see the progression of events in a file. I.e. when we get ack timeout, I want to be able to see if ack was observed later for that summary (by that or other clients). I do not see any events with "ack" in then other than nacks & timeouts in latest runs.

Acks are now "Summarize_end" events with message "summaryAck". The message is actually redundant, since all other code paths will result in "Summarize_cancel".

arinwt commented 3 years ago

Do we log number of ops we were summarizing? In situation where there are some summary failures (which is any long stress run), the primary metric to give or not green light is the max unsummarized op tail at any moment in time. Leaving the end state of the system for now, the max summarized ops will tell us that. Do we log it?

We do sort of log this in "GenerateSummary" events with property: "opsSinceLastSummary"- this is the count of ops since the last summary ack: referring to the corresponding summarize op's reference sequence number, so we actually are counting the ops being summarized. There is also "opsSinceLastAttempt" which count ops since we last tried to summarize.

Note that "GenerateSummary" event should fire even if summary process fails.

arinwt commented 3 years ago

I do think there's a need for correlating summarize attempts with their retry attempts and eventual failure. It's possible now (summaryGenTag increments by 1 with each retry, and the container id remains the same, and the reason changes from X to "retry1" to "retry2"), but not feasible for aggregating.

I think the solution could be as simple as not incrementing summarizeCount (summaryGenTag) on retries. This works well for "Summarize" events, but for any other downstream events that leverage the summaryLogger it might be better to have a 2 part schema for summaryGenTag: "3.2" for second retry (annoying to query), or we could just add another property: "retry" that increments on the logger, etc. But the solution will probably involve adding another property to the logger.

arinwt commented 3 years ago

Also I can confirm that your analysis looks correct to me. I think the main point about "SummaryAckWaitTimeout" leading to "FailToSummarize" is almost definitely right, and probably it is related to server downtime as indicated in the other issue.

I do think there are probably also missing acks in many of these cases as well, but for at least some cases, the client can recover on retry. This is not clearly indicated in the telemetry you linked, but as documents are recovering, I think it still holds true.

vladsud commented 3 years ago

Latest TAP40 run: 980e2dc0-f8ef-4661-9d5d-5d1dc8ed4630

Around 5 hours of activity 9with a tail of ~1:30 of small activity), with higher activity in fist 1.5 hours
90K container sessions

Key errors (counts above 500), grouping into related groups:

| fluid:telemetry:ElectedClientNotSummarizing | | 658,485 | | fluid:telemetry:SummaryStatus:Behind | | 40,387 | | fluid:telemetry:Container:ContainerWarning | Nack: Submit a summary before inserting additional operations | 33,611 | | fluid:telemetry:Summarizer:Running:SummaryAckWaitTimeout | | 4,970 | | fluid:telemetry:Summarizer:Running:FailToSummarize | | 842 | | fluid:telemetry:Container:storage_uploadSummaryWithContext | Error 404 (Not Found) | 504

| fluid:telemetry:DeltaManager:GetDeltas_Exception | Failed to retrieve ops from storage: too many retries | 10,175 |

| fluid:telemetry:SummaryManager:CreateSummarizerError | Container was closed while load() | 4,352 | | UnhandledPromiseRejection | 0x0d0 | 4,346 | | RunnerFailed | disposed | 2,482 | | fluid:telemetry:Container:ContainerClose | 0x07b | 517 |

Tap30: c79d0f4c-1019-4b29-9a2c-275475d1d0e8 Looks much cleaner - other than fluid:telemetry:ElectedClientNotSummarizing & fluid:telemetry:SummaryManager:CreateSummarizerError (same as above), it does not have any sizable summarizing errors.

vladsud commented 3 years ago

One more thing to add - fluid:telemetry:Summarizer:Running:Summarize_cancel escaped my queries because it does not have category = "error".

fluid:telemetry:Summarizer:Running:GenerateSummary is similar - it logs successes & failures into one bucket:

| fluid:telemetry:Summarizer:Running:GenerateSummary | | generic | 535,120 | | fluid:telemetry:Summarizer:Running:GenerateSummary | disconnected | generic | 18,665 | | fluid:telemetry:Summarizer:Running:GenerateSummary | {"message":"Error 404 (Not Found)","code":""} | generic | 504

vladsud commented 3 years ago

Looks like all the issues (including "too many retries") are connected to PUSH being throttled. And we are hitting 10K unsummarized ops all the time, so system (as it stands) is beyond capacity. Throttling limits are evaluated, but we also need more reliable mechanism to deal with situations like that. One of the biggest problems today is that there is no visibility on client side when PUSH is throttled while flushing ops.

One thing to consider:

If client had ability to fetch ops from redis (indirectly, possibly through fetching OPS from SPO), then (1) would have a recovery path. And it would also allow PUSH not to flush ops so frequently (on client connection), and thus reduce chances of being throttled and reduce impact on summary failures. It should substantially reduce pressure.

That might be also a stepping stone for single-phase summary commit, i.e. SPO pulling ops from push on summary upload. This might eliminate almost all the paths where PUSH flushes ops (no clients in session would be primary reason for flushes, flushing on timer would be rare as most of the time client would summarize and cause flush that way, also receiving throttling feedback from SPO that way).

We should also consider alternative in the form of push (i.e. client asking PUSH to flush ops). With proper feedback (success/failure reporting from PUSH) client would be able to leverage it same workflows (gap in ops, summarization). A bit harder to stage, and a bit more complicated protocol, but possibly simpler on server side and fewer dependencies of SPO on PUSH.

vladsud commented 3 years ago

Some look at overall activity in TAP40 run (resolution - 1 minute)

Non-error & error rates (blue - not errors, orange - errors):

Container loading rate (blue - regular client, orange - summarizer clients):

Disconnect rates, that drive summarizer reboot (blue - regular client, orange - summarizer clients):

And "Failed to summarize document" errors from PUSH Note that these errors start around the time when we experience wave of disconnects for summarizer, but they keep going till the end of the run.

Here are all summarization attempts - successful (blue) and not (orange, most of them are the "Failed to summarize document" errors from above):

There were 33K non-summarizing containers, and 56K summarizing containers!

Here are all the reasons for disconnects: | socket.io:server_disconnect | 124,987 | | Nack: Readonly client | 90,457 | | Disconnect: ping timeout | 35,494 | | Nack: Submit a summary before inserting additional operations | 33,612 | | Container closed | 30,624 | | Disconnect: transport close | 19,492 | | Failed to retrieve ops from storage: too many retries | 10,090

Same, but only looking at summarizing containers: | socket.io:server_disconnect | 42,125 | | Container closed | 8,572 | | Disconnect: ping timeout | 3,349 | | Disconnect: transport close | 1,939 | | Failed to retrieve ops from storage: too many retries | 398

The bits from main moved logging of reason for server disconnect into socketError property, but I see nothing there. That said, from previous runs, it's pretty clear it's token expiry:

Disconnect: socket.io: server_disconnect: TokenExpired

vladsud commented 3 years ago

Disconnects and impact on stress tests are tracked in https://onedrive.visualstudio.com/SPIN/_queries/edit/1156694/?triage=true. This seems like test config issue, given such a huge spike of expirations at exactly same time - the intention is to spread out token expirations to better reflect real usage. Data suggests test infrastructure is not doing what it is supposed to do.

vladsud commented 3 years ago

I'm closing this issue as it served its purpose. All the issues identified here are logged as their respective github issues. Epic issue will continue to track overall process and any new investigations on as needed bases.

microsoft / FluidFramework

Analyze latest ODSP stress test run #6555