microsoft / FluidFramework

Library for building distributed, real-time collaborative web applications
https://fluidframework.com
MIT License
4.71k stars 532 forks source link

Stricter modeling of Socket.io errors based on ODSP and FRS error contracts #7816

Closed markfields closed 1 year ago

markfields commented 2 years ago

Socket.io errors need to be better modeled and converted to Fluid Errors with meaningful error codes and other metadata extracted from the variable data that appears in message.

Originally posted by @markfields in https://github.com/microsoft/FluidFramework/pull/7788/files#r727371909

Other spots/notes:

markfields commented 2 years ago

Related: #7670 (started JSON.stringify websocket errors into the message property, which I'm not sure is best) and #8265

markfields commented 2 years ago

Moving to Jan to make progress on the design. Lots of churn here which is making aggregate analysis in telemetry difficult, on top of an already difficult area to analyze.

vladsud commented 2 years ago

@markfields, this item looks lower priority to me that other telemetry & connectivity issues. I'd suggest pushing it to backlog. Maybe other way to say it - it's nice to have, can you formulate what would happen if we (let's say) never address it?

markfields commented 2 years ago

@vladsud - The goal for this month is to remove design-required label, and maybe during that design process I will realize we are ok now. I know you have been changing this instrumentation a lot in recent releases, and I'm not fully caught up.

But historically these errors have defied aggregation but are very often prominent in the errors list, which makes errors analysis difficult. I'd like to see us be a bit more rigid in terms of how we model expected values coming in, and then transform those into a well-known. well-formed set of errors on the other side. Rather than using message as a free-form place to dump values directly coming over the wire.

Full design here is blocked by design for #8260, specifically around how to use errorType/message/fluidErrorCode.

markfields commented 2 years ago

With the removal of fluidErrorCode (#9249), I started an informal practice of prefixing an inner error message such as we get from Sockets with an aggregation-friendly message and a :, which lets us aggregate in telemetry reasonably well.

This meets the main pain point I opened this issue to address, but I think we swing to over-aggregating. I believe we would do well to establish some reasonable error contract with ODSP and FRS beyond just status code and message (that's what we have today) so we can use more-specific aggregation-friendly error message to cover known error cases.

But I'll move this to future milestone, and won't feel bad if we close if it sits inactive for some time.

microsoft-github-policy-service[bot] commented 1 year ago

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

markfields commented 1 year ago

@RishhiB - FYI on this old issue (which I'm closing since it's stale). It's relevant to the investigation you're doing about errors that affect the connection.