microsoft / FluidFramework

Library for building distributed, real-time collaborative web applications
https://fluidframework.com
MIT License
4.71k stars 532 forks source link

"SHA-1" error with no stack #1922

Closed arinwt closed 4 years ago

arinwt commented 4 years ago

Occurs during summarization.

arinwt commented 4 years ago

From Vlad's email

SHA-1 has interesting time-distribution (below). It comes and goes.

There a lot of docs affected, but over last day it’s mostly this one: JYIb4ycQBLFoejodz4rs9JPqKeNvGydrGQrr7%2BlfVb8%3D

This query looks into a session that generated all of the events for this doc:

union kind=outer database("Office Fluid").*Runtime*_*
| where Session_Id == "65c705a6-373f-4de7-a6f0-b32fd569b922"
| project Event_Time, Event_Sequence, Data_eventName, Data_error, Data_reason, Data_clientType

It hits SHA-1 in rapid succession, but then summarizer hits other problems (asserts, nacks from PUSH), so I doubt this doc ever summarized. What looks weird (feels like a bug, even though code should work just fine) is that summarizer reconnects as read-only.

If I had to guess, I’d think it has something to do with us using crypto in (see hashFile()). @arinwt, does safe summary path avoids dedup code in driver? If not, should there be some indication to a driver to invoke safest path if we are choosing recovery path on client side? Feels like no matter what’s the root cause of it, we should be able to recover.

arinwt commented 4 years ago

@vladsud - right now there is no indication to the driver that it should use safest path possible.

vladsud commented 4 years ago

Maybe there should be one? :) Or maybe there should not be, but the driver should self-adjust, i.e. if it is hitting an exception in blob reuse code path, it should ignore and move forward? Clearly we do not know where the error happens, but I think we can guess with good probability and maybe get us out of it (logging error to know it happened, but I think counts would be low so likely we will not look into details). What do you think?

arinwt commented 4 years ago

The cause for this issue is that Edge 18.x versions don't support SHA-1 (Edge Chromium seems to work). I found this blog post referencing it: https://docs.microsoft.com/en-us/archive/blogs/yurikasensei/sha1-users-guide (located by this SO question: https://stackoverflow.com/questions/49018153/subtlecrypto-digest-fails-for-sha-1-in-microsoft-edge)

I ran the following query with the following results:

Office_Fluid_FluidRuntime_Error
| where Data_eventName == "fluid:telemetry:Summarizer:Summarizing_cancel"
| where Data_error == "SHA-1"
| summarize count() by Data_tempBrowserInfo
| sort by Data_tempBrowserInfo;
Data_tempBrowserInfo count_
Edge 18.19603 639
Edge 18.19587 46
Edge 18.18363 22609
Edge 18.18362 1484
Edge 18.17763 516

If we support these versions of Edge(? I don't know) @heliocliu - maybe you can take over this? At least if we can get the stack on these error messages, that would be helpful in the future.

heliocliu commented 4 years ago

@curtisman Do you know the timeline for edge classic support? In these cases do we normally do a browser runtime check? Also that sha1 deprecation on windows should have just been for cryptographic usage and should still be ok for regular hashing, surprised they actually removed support for it (ran into this before when working on the Office updater)

vladsud commented 4 years ago

Can we do something cheap here? Like maybe catch exception and run through old path as a fallback? If we can avoid (adding) Node's buffer dependency, that would be awesome. The other potential solution - catch it at layer below and not do blob de-dupping SPO driver. Not ideal, but better then not summarizing at all.

@heliocliu , could you please help here and get something safe into release/0.15 ? It would be great to have something on Monday. Arin has some number of other critical issues to follow up.

heliocliu commented 4 years ago

Yeah I'm looking into it

heliocliu commented 4 years ago

@vladsud release/0.15 shows it's consuming version 0.14 of common-utils (and release/0.16 shows it's consuming 0.16). Does that mean I need to fix this in 0.14 and 0.16 instead of 0.15?

arinwt commented 4 years ago

If we don't actually support Edge classic, then maybe we can just ignore these errors and put the fix in master?

heliocliu commented 4 years ago

https://microsoft.sharepoint-df.com/:w:/t/OfficeOnlinePrague/EfSFFvhOOC9Lm_HdYOOdEfMBP2Crx5hTfdtrukEyAXzbFw?e=psSS82 says we don't support Edge Classic™ and it should eventually show a browser unsupported screen. If this is the case, I don't think we should make fixes to help support something we don't support

vladsud commented 4 years ago

This has some impact on Fluid Framework, but I'm not sure it means we do nothing. I.e. I think it has to be handled in some way. That maybe not supporting (and blocking at some level) Edge in Fluid Framework itself. Or handle this case in some reasonable way (fallback). If the fix is cheap, I'd rather fix it.

anthony-murphy commented 4 years ago

@curtisman i don't think this meets the hotpatch criteria, as i don't think the impact or severity is high for this issue.