Open LPardue opened 1 year ago
It looks like NEL is on the WebPerfWG agenda this week, should we discuss this issue there?
SGTM let me know if it would help to prepare anything
@LPardue - a short presentation or simply walking folks through the issue and use cases would probably be useful. Thanks! :)
For the record, we would +1 this proposal. We support both h2 and h3 for www.bbc.co.uk & www.bbc.com and the current set of h2/h3.protocol.error (and h2.ping_failed) don't provide us anything actionable. We do see all of those h2/h3 event reports so something is going wrong somewhere but we don't have any way to know how to tackle them.
Discussed on the Web Perf WG call last week, and there seems to be some interest in doing this.
Part of my hope for standardizing the NEL error codes is that we can get a shared agreement about the type of case that they represent. Expected cases (where error is really no error, if that's even worth a report), or more unexpected cases where, for example, the server detected the client making a specific protocol violation and closes the connection with that code.
Today, its probably an activity in spelunking Chromium source code and trying to reverse engineer the conditions. It would be good to have codes that are more expressive of a general problem area, which can motivate client or server operators to do some targetted analysis.
@neilstuartcraig can you share any data about how many instances of these errors you see?
@LPardue yep, no problem. Hopefully this'll make sense:
We have a sample rate (failure_fraction
) set at 5% and have h2 enabled globally, h3 on our CDN which serves everywhere outside the UK as BAU.
Typically, on a normal day, we'll serve in the region of 300-350M web pages on www.bbc.co.uk & www.bbc.com.
Looking at our NEL data for www.bbc.co.uk and www.bbc.com, we see roughly:
h2.ping_failed
per dayh2.protocol.error
per dayh3.protocol.error
per dayI wonder if the difference in h2 and h3 protocol error reports is down to the level of implementation maturity and/or complexity in those. Definitely highlights to me that we'd like to know more about what's going on so we can discuss with our CDN vendor.
Let me know if you need anything more, we have good data which is easy to access and I'm keen for this so happy to contribute.
Thanks Neil. I took a very brief look at our data and I can't share numbers but what I do seem to observe is that
connection
phase, whereas I would expect it to happen in the application phase
connection
phase issue. Its not clear if that means an error establishing a QUIC session, or an error in the HTTP/3 layer. Based on the brief analysis, h2.ping_failed is potentially actionable (depending who you are, it could highlight a network problem that could be addressed by contacting some NOC). However, if it is just a different format for articulating TCP connectivity issues, then its duplicative and a distraction. Maybe someone on the client side can speak to what it means when this error happens.
Drilling into the actual h2 or h3 errors seems useful in order to break down benign issues versus real things. We should also probably consider breaking out QUIC transport errors desperately, so that HTTP/3 errors are clearly errors in that layer, not anything else.
This might fit well into the HappyEyeballsV3 discussions starting up in the IETF. (eg, https://datatracker.ietf.org/doc/html/draft-pauly-v6ops-happy-eyeballs-v3-01) That may be a broader issue of which this is one part.
In the past I tried to improve HTTP/2 and HTTP/3 errors in Chromium (https://chromium-review.googlesource.com/c/chromium/src/+/5400899), and now I'm trying to implement HappyEyeballs v3 in Chromium.
I agree that this could be a part of HEv3 reporting discussion (#175, #176).
HTTP/2 and HTTP/3 have many cases where after a successful handshake, a request can fail for a variety of error codes. Request streams can be canceled or abruptly terminated either before a status code is returned by a server, or after a status code is returned but before the response is fully delivered. These are stream errors, not connection errors, which are subtly different. Although, it can be useful to observe both.
Conventional access logs or HTTP request logs really struggle with capturing these types of failures. Meaning website operators do not get good insight into what is happening.
Cloudflare's public stats at https://radar.cloudflare.com/adoption-and-usage show that HTTP/2 and HTTP/3 comprise over 90% of all requests to the Cloudflare edge (for "likely human" traffic). The scale of this NEL gap is pretty tremendous by request volume.
NEL provides an opportunity to improve the situation for various stakeholders. Better visibility may help to spot difficult to repro problems that occur as a low fraction percentage. For example, at a meeting during IETF 116 we discussed ( an implementation bug discovered recently that would be detected as connection error of code FINAL_SIZE_ERROR. This occurred at an aggregate rate of about 0.001% but occurred much more in networks with higher rates of loss. Once found, the root cause of this problem was identified and quickly fixed. But it had been that way for several years.
Related to #119 here and Chrome bug https://bugs.chromium.org/p/chromium/issues/detail?id=1121658