w3c / network-error-logging

Network Error Logging
https://w3c.github.io/network-error-logging/
Other
81 stars 18 forks source link

Proposal: Control delivery of network error types on client side #133

Open rfmoz opened 1 year ago

rfmoz commented 1 year ago

NEL have a custom set of network error types, but any way to filter them.

For example, it would be nice to receive http.* ones in some url and others in a different one.

Let mty clarify with an example:

nel: {"report_to":"default","max_age":3600,"filter":"*.*"}
nel: {"report_to":"dns","max_age":3600,"filter":"dns.*"}
nel: {"report_to":"web","max_age":3600,"filter":"http.*"}
report-to: {"group":"default","max_age":3600,"endpoints": [{"url":"https://example.com/default"}]}
report-to: {"group":"dns","max_age":3600,"endpoints": [{"url":"https://example.com/dns"}]}
report-to: {"group":"web","max_age":3600,"endpoints": [{"url":"https://example.com/web"}]}

Thanks,

Sora2455 commented 1 year ago

Seconded: I stopped collecting NEL errors due to the volume of HTTP error codes that were generated by normal usage (404s etc) and the fact that I almost never got non-web error codes. If I could collect non-web errors (which I can catch with server-side logs) and only non-web errors, I'd very happily use this feature again.

neilstuartcraig commented 1 year ago

Thirded, we'd definitely use this type of configuration granularity. The other useful part of being able to split the config per event type is that we could then have different failure_fractions - right now we run at 5% globally because we'd otherwise get swamped (and it'd be expensive) by noise from the http.* stuff (due to spammy requests from scanners and chancers) but we'd really like to collect all the DNS, TCP and TLS reports.

clelland commented 1 year ago

We should discuss this in a WG call, I think. @neilstuartcraig, would you be interested in attending to present the issues you're encountering, for some context?

neilstuartcraig commented 1 year ago

We should discuss this in a WG call, I think. @neilstuartcraig, would you be interested in attending to present the issues you're encountering, for some context?

Yep, definitely. I can probably gather a bit of useful supporting info if i know enough in advance.

clelland commented 1 year ago

I think that NEL is actually on the agenda this week -- would that give you enough time?

neilstuartcraig commented 1 year ago

Yep, I should be able to pull some info together before then - assuming there's no major events of course! Looking forward to it.

neilstuartcraig commented 1 year ago

We discussed this on last week's Web Perf call (thanks for that, it was great to talk it through) and there were a few questions/comments/thoughts which I'll do my best to note below

For completeness, the use case I put forward is in our (BBC) current live instance of NEL - essentially we have to manage costs as there's a direct per-report cost impact - so we use sampling via failure_fraction to do this at 5% across all our sites on www.bbc.co.uk & www.bbc.com but some sites are a lot busier than others and we also have busier audience countries. This coupled with us seeing ~90% of NEL reports being abandoned or unknown (which we do want to know about but are less specific and thus harder to direrctly action than other event types) means that we end up receiving very few reports for some definite incidents than we'd like - and we can't simply crank up the failure_fraction to account for this as our costs would spiral. So we'd like to downsample abandoned and unknown to better balance the event types and get better value and coverage across sites/audience countries from NEL.

Here's the questions (Q) / comments (C) / notes (N) I noted & a couple more I thought of afterwards:

C1: We (BBC) would definitely want to retain failure_fraction (& I guess for completeness also success_fraction - though we don't use that) in filtered NEL policyt definitions

N1: This proposal would not affect Reporting API policies, it only affects NEL

Q1: What should the behaviour be if a NEL policy is defined which only covers some NEL event classes/types? Should the other event types be ignored? (consider bother existing NEL event classes/types and also any which may be added in future)

Q2: (I thought of this afterwards) Do we actually need to filter as specifically as <event class>.<event type>? I wonder if we could just filter on event_class (e.g. dns, tcp, tls...) which I presume would be less burdensome on implementers and I think would make filtered policies less brittle in that we're much less likely to see new event classes added than we are event types - for instance, there's a request to add more event types to the h2 and h3 event classes in #134 so anyone who defines a filterer NEL policy prior to those being added, assuming they are added, may unintentionally miss out on receiving those reports.

Also, @rfmoz - it would be useful for the discourse if you could note your use case for filtered NEL policies.

I hope that helps and covers everything - did I miss anything you can think of @yoavweiss? Cheers!

neilstuartcraig commented 1 year ago

...and to provide answers to the Qs from our PoV:

Q1: I would expect any NEL event classes/types which are not covered by defined NEL policies to be dropped/ignored (not reported) by the client. The obvious way around that could be to define specific filtered NEL policies alongside a general unfiltered policy.

Q2: We would only need to filter at event class level, we don't need anything as specific as per-event type filtering.

nicjansma commented 1 year ago

W3C WebPerf April 27 2023 call minutes for reference: https://docs.google.com/document/d/1QyEQA-3LCuORl2Xy9aHbn7Ja4Uy3dDouINM_vpSoSOQ/edit?usp=sharing

rfmoz commented 1 year ago

@neilstuartcraig I'm agree with you, it would be enough to filter by event class.

The reference in the example tries to cath all the cases available, event class + type, but it's true that working only with the event class could cover most situations and simplify their implementation.

In our specific case, we are using a third party provider to process the reports. We paid for the amount of reports and around 99% are produced by class http. Those http information is covered by other providers/integrations on the website and it is useless for us, so the valuable amount of NEL reports is only 1%.

If we have the option to discard it, that would be a huge decrease in volume and its respective processing price at first, but also that let us the chance to rise the failure_fraction from "0.1" to "1".

This is an update example using only event class:

nel: {"report_to":"default","max_age":3600,"filter":"*"}
nel: {"report_to":"web","max_age":3600,"filter":"http"}
report-to: {"group":"default","max_age":3600,"endpoints": [{"url":"https://example.com/default"}]}
report-to: {"group":"web","max_age":3600,"endpoints": [{"url":"null"}]}

Thanks

neilstuartcraig commented 1 year ago

Thanks for the info @rfmoz - that's great. A couple of follow-ups:

Can I just confirm that you don't have a use case for sending different event classes/types to different reporting endpoints?

Do you have an opinion on what to do if a page has a NEL configuration which includes only a subset of event classes in the filter? IMO it'd be logical for the client/browser to drop/not send any event classes/types which aren't defined - keen to know if you differ or agree on that.

Overall, it sounds very much like we have essentially the same use case - wanting to better balance the report split over event classes by reducing how many http and abandoned reports we receive.

My last wondering is around syntax - I wonder if rather than filter it might be clearer if the syntax used e.g. include or similar - filter might imply that the stated event classes are filtered out. I guess we could maybe have include and/or exclude.

Critically, I would also want to be able include multiple event classes so we don't have to send loads of NEL headers to downsample http and abandoned.

rfmoz commented 1 year ago

Can I just confirm that you don't have a use case for sending different event classes/types to different reporting endpoints?

You're right, currently I don't have a use case for sending different event classes/types to different endpoints.

Do you have an opinion on what to do if a page has a NEL configuration which includes only a subset of event classes in the filter?

I'm agree with you, avoid sending any event classes/types which aren't defined.

Overall, it sounds very much like we have essentially the same use case - wanting to better balance the report split over event classes by reducing how many http and abandoned reports we receive.

We are in the same line, as the reports cover a wide scenario, we would like to get only the useful ones for our situation, trying to control them at the origin.

My last wondering is around syntax - I wonder if rather than filter it might be clearer if the syntax used e.g. include or similar - filter might imply that the stated event classes are filtered out. I guess we could maybe have include and/or exclude.

Great, filter was the first word that I tought. include looks to fit better within the context.

Critically, I would also want to be able include multiple event classes so we don't have to send loads of NEL headers to downsample http and abandoned.

If that fits well with the implementation, completely agree too.

So, the update code example, with the include word and including a few classes inside one header:

nel: {"report_to":"default","max_age":3600,"include":"dns,tcp,tls"}
report-to: {"group":"default","max_age":3600,"endpoints": [{"url":"https://example.com/default"}]}