w3c / network-error-logging

Network Error Logging
https://w3c.github.io/network-error-logging/
Other
80 stars 18 forks source link

"Happy Eyeballs" failure reporting #175

Open enygren opened 1 month ago

enygren commented 1 month ago

RFC 8305 defines a "Happy Eyeballs" behavior that allows clients to try IPv6 first but fail over to IPv4 when it's unavailable in a timely manner. There is a proposal to extend this further to handle the increasing number of endpoint candidates (eg, QUIC, mutliple SVCB records, etc) which clients might try (see https://datatracker.ietf.org/doc/html/draft-pauly-v6ops-happy-eyeballs-v3-01)

An ongoing operational concern is that while this Happy Eyeballs behavior greatly improves user experience (eg, a user has multiple protocols and IP versions and IP addresses to try) it can hide partial failures in the network. For example, if a content provider (or ISP) has broken IPv6 then they may not notice as dual-stack users will fallback to IPv4.

It would be highly valuable if NEL could report these sorts of issues where a client had multiple endpoints it could connect to and some were unreachable.

Some considerations:

simon-friedberger commented 1 month ago

I think this is an interesting idea. The original problem for NEL was "a client cannot reach me and I want to know about this event" and then it includes information that might help with debugging issues. One concept being "If the client can reach me, I don't need NEL." Even though it is - AFAIR - not correct anymore the spec still states:

To prevent information leakage, NEL reports about a request do not contain any information that is not visible to the server when processing the request.

With IPV4/IPV6 and h2/h3 and HTTPS-upgrades there might be useful information which clients are not making available today like "I tried IPV6 but it didn't work." Although you could argue, that in this case maybe you should have gotten a report for your IPV6 endpoint.

The hard part for this and #176 will be to balance utility and privacy. IMHO It's harder to judge than for the original NEL because "the user wants to connect but cannot" is some motivation to make the users participate in network debugging but if the user does connect with IPV4, why should they provide any kind of privacy sensitive information so somebody else can debug their IPV6 network? And looking at the utility, if you see only IPV4 connections from a certain area, can't you already deduce that IPV6 might not be working?

LPardue commented 1 month ago

Speaking for the server side: it feels there's a lot of nuance into the client decision making. Not necessarily outright IP version X was blocked or failed, nor HTTTP version Y was blocked or failed. But that the user agent, rolled a set of dice for enough rounds to find a combo result good enough. The servers can infer the final dice roll leading to a successful connection , but get no insight into the lead up. In actuality, ignoring such failures could prevent identifying systematic issues, which hurt both client and server.

pmeenan commented 1 month ago

If I recall correctly, NEL is meant to help site owners identify infrastructure issues for the parts of the infrastructure that are under their control that happen at a point in time when they have no visibility into the connection attempt.

It feels like ISP IPv6 issues fall outside of the scope of NEL. Same goes for any nuance in the client's decision making that isn't tied to something like the contents of the HTTPS DNS records that is under the control of the site (or CDN they are using).

clelland commented 1 month ago

Let's discuss this at the next WG call; I think that there are things that we can do here, without trying to turn NEL into a general-purpose network troubleshooting tool

LPardue commented 1 month ago

From my perspective, site owners delegate these details to their IaaS provider. Then they might review request logs or other telemetry and ask "why was this used over that", "is my website config actually being used", etc. Fallbacks, by their nature mask problems. This can be compunded by multi CDN setups whwre xonfugs might differ.

Not providing network error information when there was a newtwork error seems counterintuitive to me :)