microsoft / FluidFramework

Library for building distributed, real-time collaborative web applications
https://fluidframework.com
MIT License
4.73k stars 532 forks source link

Duplicate clientsIds problem #1923

Closed vladsud closed 4 years ago

vladsud commented 4 years ago

Due to a race condition, summarizer & main client can get assigned same clientId. That completely screws clients.

vladsud commented 4 years ago

PR #1919 - main (correct) PR requiring changed PUSH side. Hot fix we may take if main fix take long: PR #1918

vladsud commented 4 years ago

@GaryWilber, @viz-msft, can we get some update here? Main PR is still not merged. What is the status on PUSH side?

vladsud commented 4 years ago

Teams thread

agarwal-navin commented 4 years ago

PR #1919 is now merged.

anthony-murphy commented 4 years ago

Validated nonces are flowing in both directions

vladsud commented 4 years ago

I do not think the problem is fully solved. This query still shows we hit duplicate client Ids.

union kind=outer database("Office Fluid").Office_Fluid_FluidRuntime_Performance | where Data_eventName == "fluid:telemetry:Container:ConnectionStateChange_Connected" | where Data_clientType == "interactive" | where Event_Time > ago(5d) | project Event_Time, Data_clientId, Data_clientType, Data_docId | join ( union kind=outer database("Office Fluid").Office_Fluid_FluidRuntime_Performance | where Data_eventName == "fluid:telemetry:Container:ConnectionStateChange_Connected" | where Data_clientType == "noninteractive/summarizer" | project Summarizer_Time = Event_Time, Data_clientId, Data_clientType, Data_docId ) on Data_docId, Data_clientId | summarize count() by bin(Event_Time, 1h) | render timechart

I think it's fine to close it if we link to a bug tracked somewhere else, to allow people to continue tracking status.

vladsud commented 4 years ago

Errors are still in the system, but very early data from 0.16 suggest the problem is gone. I'd love to wait for these bits to hit production (today) and some saturation (tomorrow) and reassess telemetry before closing. Alternative PR (solving this problem through client-only hack) is closed.

vladsud commented 4 years ago

Latest here - Fluid is on 0.16.1 and never picked up client side of the fix. So we are waiting for 0.17 and it eventually going to production, which is still a week away.

vladsud commented 4 years ago

I believe 3 cases of errors mentioned in Issue #2139 are result of this bug:

Incoming remote op sequence# <= local collabWindow's currentSequence# Incoming remote op sequence# <= local collabWindow's currentSequence# Incoming local client seq# > generated by this client

anthony-murphy commented 4 years ago

Looking at the client code, it appears the fix is in 0.16.x: https://github.com/microsoft/FluidFramework/blob/release/0.16.x/packages/drivers/odsp-socket-storage/src/odspDocumentDeltaConnection.ts Bohemia took the lastest, 0.16.x this week. so i think they should have the fix now, but not sure how far that got deployed

vladsud commented 4 years ago

16.5 was pushed to production yesterday, I believe.

Here is distribution of performance events (as a proxy of what builds are out there), over last 24 hours:

0.16.1 - 41,784 0.16.5 - 24,431 0.15.25266 - 3,238 0.15.24931 - 217 0.15.24351 - 108 0.15.23736 - 86 0.15.24728 - 28 0.15.23641 - 21

There are 152 hits of duplicate clientId over last 24 hours, but they are all in 0.16.1. So I think the data indicates that it indeed is finally addressed, but we have long pole of old builds in circulation that it will take time to flush from the system to stop seeing this problem.

BTW, @arinwt, the data also suggests it was regression of 0.16. Given that original bug was there all along, I believe we started to hit it due to summarizer booting faster / next to main container. That by itself is Ok (and duplicate clientId bug needs to be addressed anyway), but from responsiveness POV, do we want to boot summarizing container right away next to boot of main container? Or is this only at play on reconnects? Later should also not happen as summarizer should not reconnect. Something to take a closer look.

arinwt commented 4 years ago

@vladsud, I believe this regression is because of reconnect scenario. The change allowed for summarizer to not automatically close itself before connecting if parent container disconnects (i.e. so summarizer can live a little longer). When parent reconnects while summarizer connects/reconnects, this is when the clientIds frequently get stolen/confused.

There is already a 5 second delay before spawning the summarizer, unless there are 4k+ outstanding ops.

Also, because we had a lot of issues causing summarization to fail, leading several docs to enter the >5k outstanding ops scenario. Once in this scenario, we bypass the delay, and enter short reconnect cycles. Also like you said, summarizer container shouldn't be reconnecting, but that was happening too.

sumedhb1995 commented 4 years ago

It looks like we are no longer seeing this error on any new documents on 16.5 The ones still emitting errors look like they were corrupted in 16.1 MicrosoftTeams-image (6)

Query: Office_Fluid_FluidRuntime_Error | where Data_error has "Incoming remote op sequence# <= local collabWindow's currentSequence#" and Data_loaderVersion startswith "0.16" | summarize make_set(Session_Id, 999999)

let sessionids = dynamic([   "7e89c480-e108-4c4a-a699-b4c5f28e195d",   "41454d8c-2b39-4cd2-ab7e-e48f404d9095",   "242e0ff2-bd0e-46e8-a15c-df15e7621d58",   "8743a70b-3e68-40d7-b1fd-6bae6aee51cd",   "bd55c8e9-bc5b-4309-a721-fc6c70ddedd6",   "839e343a-ffb4-4965-8e63-72eb33a4b7a7",   "8013d4cb-b4be-4ec0-9ca8-09258a5d4c46",   "f6410115-c38f-4ba9-9254-4ec496c0defa",   "53d9714d-0a33-4663-b0fa-de349fb1541b",   "ae7b3576-c000-4c7a-919a-5b120e1bf8a2",   "1a3e4ebf-a82a-4988-bc2f-c2904eeebdf5",   "ba5b7c88-9617-4e24-b75f-915bb72907c2",   "09719b9c-032e-430f-b378-ed4f1d132e91",   "5bab16ac-3e8c-4662-96f5-b9410c4fb3a3",   "9d0ad4dd-f768-409f-9f9e-144fb1da678f",   "41f083bb-b41d-4bdf-a24e-c5800cee6438",   "7b494ab2-d6f8-4943-b95b-33cc294ee405",   "1f4346a6-691d-4786-88cb-8bc67f8f7d3d",   "3f957730-2e0a-4bad-8d44-5f19fca9d282",   "4a8f4312-eb1f-4a33-8c9d-d3583ba609f7",   "58687dcd-1328-493c-9893-10f5991d1f9e",   "f070aa10-6bdb-47e7-96e8-84bf034c6ced",   "afec94ab-0983-4d76-a562-c3857cdce7ea",   "9724eef0-9ee9-4c68-a9ed-66b230926971",   "a9cae135-312d-477f-a2ec-d34a4382df42",   "4bdf8fce-b578-441d-b110-cc98367d1138",   "7ad6c321-4255-4a96-bff9-cb4838f5b800",   "1d62cf37-1504-4a97-8d07-8fc19b4f16ce",   "e5098284-5208-4b6e-ac90-3115c57f522b",   "534031c4-81e1-482a-b617-095f90287b41",   "979a9bf1-3cda-42aa-ab4a-1054aa985351",   "22410b07-b8a5-4d03-994a-8a379146d486",   "14ece152-226e-4c2a-ae36-5143142cd06b",   "2c1d222f-eba9-4937-8829-707a0d167d4b",   "c591c10a-9b36-416b-8bbc-063f3701e577",   "d741bdd1-b398-4abf-a44c-fdffbd54222b",   "b1bc78c9-9636-4f26-9728-4178ec81920b",   "66316d1d-dccc-4775-8849-4284a50fc9bb",   "3aa25723-3722-4fdf-ad94-b8e0743c7b02",   "6a7051a3-6b77-4581-bf99-a555abccf6ca",   "c34c1df8-edfe-4731-bdfd-eb556678cd49",   "58164959-cd43-46a7-8fae-d58f84fcbc39",   "32e7a79a-f82c-4c46-b2e0-4de4b23029dc",   "7544c9d4-2a61-4df7-b16d-ce4f2935721b",   "f3dcff00-ad33-43be-918e-7b097fb33aae",   "3c0d4a64-bf1f-4605-8893-1d77de0702dd",   "b77b6f3e-79bc-4772-b7e8-403fa18e82d4",   "c8454503-aa34-42d5-b111-008b98c9de20",   "b8b70844-57c2-4e0a-b112-32d2fb2dddcc",   "4a527959-4f89-4529-b1b2-72403902bbbe",   "1dca7335-356d-4036-af10-cd87ac1dace0",   "b8caf296-c67f-4e71-a249-673158809312",   "c7c2c643-8375-49de-94b1-eebcf3da4a30" ]); Office_Fluid_FluidRuntime_Error | where Session_Id in (sessionids) | summarize count(), makeset(Data_loaderVersion) by Data_docId