Closed vladsud closed 4 years ago
PR #1919 - main (correct) PR requiring changed PUSH side. Hot fix we may take if main fix take long: PR #1918
@GaryWilber, @viz-msft, can we get some update here? Main PR is still not merged. What is the status on PUSH side?
PR #1919 is now merged.
Validated nonces are flowing in both directions
I do not think the problem is fully solved. This query still shows we hit duplicate client Ids.
union kind=outer database("Office Fluid").Office_Fluid_FluidRuntime_Performance | where Data_eventName == "fluid:telemetry:Container:ConnectionStateChange_Connected" | where Data_clientType == "interactive" | where Event_Time > ago(5d) | project Event_Time, Data_clientId, Data_clientType, Data_docId | join ( union kind=outer database("Office Fluid").Office_Fluid_FluidRuntime_Performance | where Data_eventName == "fluid:telemetry:Container:ConnectionStateChange_Connected" | where Data_clientType == "noninteractive/summarizer" | project Summarizer_Time = Event_Time, Data_clientId, Data_clientType, Data_docId ) on Data_docId, Data_clientId | summarize count() by bin(Event_Time, 1h) | render timechart
I think it's fine to close it if we link to a bug tracked somewhere else, to allow people to continue tracking status.
Errors are still in the system, but very early data from 0.16 suggest the problem is gone. I'd love to wait for these bits to hit production (today) and some saturation (tomorrow) and reassess telemetry before closing. Alternative PR (solving this problem through client-only hack) is closed.
Latest here - Fluid is on 0.16.1 and never picked up client side of the fix. So we are waiting for 0.17 and it eventually going to production, which is still a week away.
I believe 3 cases of errors mentioned in Issue #2139 are result of this bug:
Incoming remote op sequence# <= local collabWindow's currentSequence# Incoming remote op sequence# <= local collabWindow's currentSequence# Incoming local client seq# > generated by this client
Looking at the client code, it appears the fix is in 0.16.x: https://github.com/microsoft/FluidFramework/blob/release/0.16.x/packages/drivers/odsp-socket-storage/src/odspDocumentDeltaConnection.ts Bohemia took the lastest, 0.16.x this week. so i think they should have the fix now, but not sure how far that got deployed
16.5 was pushed to production yesterday, I believe.
Here is distribution of performance events (as a proxy of what builds are out there), over last 24 hours:
0.16.1 - 41,784 0.16.5 - 24,431 0.15.25266 - 3,238 0.15.24931 - 217 0.15.24351 - 108 0.15.23736 - 86 0.15.24728 - 28 0.15.23641 - 21
There are 152 hits of duplicate clientId over last 24 hours, but they are all in 0.16.1. So I think the data indicates that it indeed is finally addressed, but we have long pole of old builds in circulation that it will take time to flush from the system to stop seeing this problem.
BTW, @arinwt, the data also suggests it was regression of 0.16. Given that original bug was there all along, I believe we started to hit it due to summarizer booting faster / next to main container. That by itself is Ok (and duplicate clientId bug needs to be addressed anyway), but from responsiveness POV, do we want to boot summarizing container right away next to boot of main container? Or is this only at play on reconnects? Later should also not happen as summarizer should not reconnect. Something to take a closer look.
@vladsud, I believe this regression is because of reconnect scenario. The change allowed for summarizer to not automatically close itself before connecting if parent container disconnects (i.e. so summarizer can live a little longer). When parent reconnects while summarizer connects/reconnects, this is when the clientIds frequently get stolen/confused.
There is already a 5 second delay before spawning the summarizer, unless there are 4k+ outstanding ops.
Also, because we had a lot of issues causing summarization to fail, leading several docs to enter the >5k outstanding ops scenario. Once in this scenario, we bypass the delay, and enter short reconnect cycles. Also like you said, summarizer container shouldn't be reconnecting, but that was happening too.
It looks like we are no longer seeing this error on any new documents on 16.5 The ones still emitting errors look like they were corrupted in 16.1
Query: Office_Fluid_FluidRuntime_Error | where Data_error has "Incoming remote op sequence# <= local collabWindow's currentSequence#" and Data_loaderVersion startswith "0.16" | summarize make_set(Session_Id, 999999)
let sessionids = dynamic([ "7e89c480-e108-4c4a-a699-b4c5f28e195d", "41454d8c-2b39-4cd2-ab7e-e48f404d9095", "242e0ff2-bd0e-46e8-a15c-df15e7621d58", "8743a70b-3e68-40d7-b1fd-6bae6aee51cd", "bd55c8e9-bc5b-4309-a721-fc6c70ddedd6", "839e343a-ffb4-4965-8e63-72eb33a4b7a7", "8013d4cb-b4be-4ec0-9ca8-09258a5d4c46", "f6410115-c38f-4ba9-9254-4ec496c0defa", "53d9714d-0a33-4663-b0fa-de349fb1541b", "ae7b3576-c000-4c7a-919a-5b120e1bf8a2", "1a3e4ebf-a82a-4988-bc2f-c2904eeebdf5", "ba5b7c88-9617-4e24-b75f-915bb72907c2", "09719b9c-032e-430f-b378-ed4f1d132e91", "5bab16ac-3e8c-4662-96f5-b9410c4fb3a3", "9d0ad4dd-f768-409f-9f9e-144fb1da678f", "41f083bb-b41d-4bdf-a24e-c5800cee6438", "7b494ab2-d6f8-4943-b95b-33cc294ee405", "1f4346a6-691d-4786-88cb-8bc67f8f7d3d", "3f957730-2e0a-4bad-8d44-5f19fca9d282", "4a8f4312-eb1f-4a33-8c9d-d3583ba609f7", "58687dcd-1328-493c-9893-10f5991d1f9e", "f070aa10-6bdb-47e7-96e8-84bf034c6ced", "afec94ab-0983-4d76-a562-c3857cdce7ea", "9724eef0-9ee9-4c68-a9ed-66b230926971", "a9cae135-312d-477f-a2ec-d34a4382df42", "4bdf8fce-b578-441d-b110-cc98367d1138", "7ad6c321-4255-4a96-bff9-cb4838f5b800", "1d62cf37-1504-4a97-8d07-8fc19b4f16ce", "e5098284-5208-4b6e-ac90-3115c57f522b", "534031c4-81e1-482a-b617-095f90287b41", "979a9bf1-3cda-42aa-ab4a-1054aa985351", "22410b07-b8a5-4d03-994a-8a379146d486", "14ece152-226e-4c2a-ae36-5143142cd06b", "2c1d222f-eba9-4937-8829-707a0d167d4b", "c591c10a-9b36-416b-8bbc-063f3701e577", "d741bdd1-b398-4abf-a44c-fdffbd54222b", "b1bc78c9-9636-4f26-9728-4178ec81920b", "66316d1d-dccc-4775-8849-4284a50fc9bb", "3aa25723-3722-4fdf-ad94-b8e0743c7b02", "6a7051a3-6b77-4581-bf99-a555abccf6ca", "c34c1df8-edfe-4731-bdfd-eb556678cd49", "58164959-cd43-46a7-8fae-d58f84fcbc39", "32e7a79a-f82c-4c46-b2e0-4de4b23029dc", "7544c9d4-2a61-4df7-b16d-ce4f2935721b", "f3dcff00-ad33-43be-918e-7b097fb33aae", "3c0d4a64-bf1f-4605-8893-1d77de0702dd", "b77b6f3e-79bc-4772-b7e8-403fa18e82d4", "c8454503-aa34-42d5-b111-008b98c9de20", "b8b70844-57c2-4e0a-b112-32d2fb2dddcc", "4a527959-4f89-4529-b1b2-72403902bbbe", "1dca7335-356d-4036-af10-cd87ac1dace0", "b8caf296-c67f-4e71-a249-673158809312", "c7c2c643-8375-49de-94b1-eebcf3da4a30" ]); Office_Fluid_FluidRuntime_Error | where Session_Id in (sessionids) | summarize count(), makeset(Data_loaderVersion) by Data_docId
Due to a race condition, summarizer & main client can get assigned same clientId. That completely screws clients.