Rate limiting - Githubissues

danmx commented 7 years ago

Endpoint should be able to set rate limiting for reporting, e.g. by setting a percentage 0-100. This would reduce number of duplicated reports and would reduce a risk of DDoSing an endpoint when users of high traffic site start sending reports.

My proposal: Before sending a report browser will draw a random number from 1 to 100 if the number is lower or equal than rate limit setting (0-100) it'll send a report.

juliatuttle commented 7 years ago

I'd be tentatively in favor of this, but I'd prefer a float (e.g. 0.0 is no reports, 1.0 is all reports, 0.001 is 1 in 1000 reports).

On Fri, Aug 4, 2017, 10:14 Dan notifications@github.com wrote:

Endpoint should be able to set rate limiting for reporting, e.g. by setting a percentage 0-100. This would reduce number of duplicated reports and would reduce a risk of DDoSing an endpoint when users of high traffic site start sending reports.

My proposal: Before sending a report browser will draw a random number from 1 to 100 if the number is lower or equal than rate limit setting (0-100) it'll send a report.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/WICG/reporting/issues/45, or mute the thread https://github.com/notifications/unsubscribe-auth/AAYojyMeALFJ_MbryncNDR4Im56Q5qPJks5sUycbgaJpZM4OtxSS .

patrickkettner commented 7 years ago

@mikewest what is the upper end for number of reports that are triggered in CSP deployments youve seen? A lot of the conversation @RByers and I had around filtering and limiting are based on the idea that we are expecting low double digit number of reports. As a result, thinking about rate limiting may be premature. At least for now. Since deprecation, intervention, and crash reports are unlikely to trigger more than a handful on a page, the only thing I am worried about here is CSP reporting - thoughts?

mikewest commented 7 years ago

@scotthelme or @arturjanc's team may have public numbers?

I vaguely recall complaints inside Google when folks were first starting to roll out CSP that some apps were seeing reports in the same order of magnitude as their page views and had to start sampling manually (by leaving off report-uri directives). But I don't have those emails anymore, and I might be way off in my memory.

I think this seems like a reasonable thing to add in general, but I don't feel like it needs to be part of a v1.

patrickkettner commented 7 years ago

I think this seems like a reasonable thing to add in general, but I don't feel like it needs to be part of a v1.

+1. creating new v2 tag to track these sorts of things

ScottHelme commented 7 years ago

As the recipient of billions of reports per month on https://report-uri.com I'm always interested in ways users can control the flow rate of reports but I have my reservations about a mechanism built into the report API like this.

As @mikewest mentioned the best way to currently do this, and the way I recommend, is to inject the report-uri directive into your policy for a subset of responses. There are many better ways to do this but as always it's browser support that prevents them from being useful.

The 'Backoff' header was one way I'd have liked the reporting endpoint to be able to control this. It could be returned alongside a 429 for dynamic control of reporting volumes but isn't supported.

Another way to do this would be to catch the SecurityPolicyViolation event and handle it manually, something I've been testing extensively. The problem again is that only Chrome implements the interface but you could quite easily place the logic to down-sample reports there too. The benefit of this is that rather than being a random down-sample you could actually filter on the client too, further increasing the value of reports sent by reducing noise.

Overall I'm not opposed to the idea suggested here or to the overall idea of a mechanism to rate limit reports, but unless it ends up widely supported by all browser it won't be very effective. That's why omitting the report-uri directive in your policy is still the best way to achieve this, it's reliable and supported by all browsers.

dcreager commented 6 years ago

NEL includes a sampling rate like is discussed here (as of this patch). Two sampling rates, in fact, so that you can separately limit the reports about failures and successes. This mechanism has worked really well in NEL's predecessor, which is why we wanted to make sure it was in the NEL spec from the beginning, and a required part of a conforming implementation.

In the WebPerf WG call today, someone asked whether NEL's sampling rate should be moved into the Reporting API, so other Reporting-dependent specs wouldn't have to reinvent the wheel.

dcreager commented 6 years ago

Opened up a related issue on the NEL side (nel#71). The fact that NEL lets you provide two sampling rates (one for successes, one for failures) would complicate adding sampling rates to Reporting.

My preference would be to keep Reporting simple, and have a single sampling rate for each endpoint group. That means on the NEL side, instead of providing separate sampling rates for successes vs failures, you'd specify different Reporting endpoint groups for each. That's the simplest separation of concerns, though it does mean that you'd have to duplicate a lot of information across those two endpoint groups. My initial hunch is that that's still the right tradeoff, but I'd love to hear other opinions.

dcreager commented 6 years ago

@juliatuttle expressed concern about the size of the Report-To header if you have to duplicate endpoint groups just to be able to set different sampling rates, especially if you follow the advice of having several endpoints in each group for failover purposes.

Right now, it sounds like there are three options on the table:

No change to Reporting. If a spec wants sampling of reports, it's up to you to define and implement that on a case-by-case basis. The current draft of NEL has some language that you could copy for this purpose.
Add sampling-fraction as an optional property of each endpoint group, with a default value of 1.0. If a spec needs separate sampling rates for different kinds of reports, you have to define separate endpoint groups, duplicating the endpoint URLs as a result. Your Report-To header value would look like:
```
[{"name": "nel-success-group",
 "sampling-fraction": 0.1,
 "endpoints": [
   { "url": "http://example.com/nel", "priority": 1 },
   { "url": "http://backup.com/nel", "priority": 2 }
 ]},
{"name": "nel-failure-group",
 "sampling-fraction": 1.0,
 "endpoints": [
   { "url": "http://example.com/nel", "priority": 1 },
   { "url": "http://backup.com/nel", "priority": 2 }
 ]}]
```
This is the simplest change to Reporting, but comes at a cost in response header size.
Build on PR #67 (which makes endpoint groups the top-level concept in the Report-To header values), and add a new option-sets field to it. That would let you define sampling rates (as well as any other future per-endpoint-group options) independently of the set of endpoint URLs in the group. Your Report-To header value would look like:
```
[{"name": "nel-group",
 "endpoints": [
   { "url": "http://example.com/nel", "priority": 1 },
   { "url": "http://backup.com/nel", "priority": 2 }
 ],
 "option-sets": [
   { "name": "successes", "sampling-fraction": 0.1 },
   { "name": "failures", "sampling-fraction": 1.0 }
 ]}]
```
We'd update NEL to specify a separate option set for successes and failures, which is where it would get the sampling rates from. This is a more invasive change, but would have the smallest effect on response header size.

Does this sound like an accurate summary? Are there any strong opinions for or against any of the options?

patrickkettner commented 6 years ago

I believe option 2 is the best, as various endpoints may have different tolerances for rate limiting.

igrigorik commented 6 years ago

Option 3 introduces strong coupling between Reporting and upstream specs: you have to know the name of the report type and relevant parameters that apply to it. In case of NEL you'll now need to provide NEL specific configuration to both the NEL and Report-To headers — this is confusing and something I'd love to avoid.

Option 2: is based on the premise that sampling rate should be a first-class feature in Reporting. I don't have operational experience in this regard, but this does seem like a reasonable feature in light of experiences that Mike highlighted above, and since this is our chance to spec the behavior we expect from browsers.. might be worth it.

On the other hand, I can also see a desire to have different sampling rates for each type of report, which hints to me that this should actually be an option associated to the report generator, not the downstream reporting endpoint? Specifying different endpoints to control sampling seems like a backwards way to go about it? For example, for NEL we can expose config options in the header and have NEL logic apply sampling before it calls out to Reporting API. Ditto for other API's; if CSP wants sampling, expose a CSP specific config option for it?

dcreager commented 6 years ago

I might have hand-waved too much in my description of option 3. I was thinking of it purely as a syntactic refactoring of option 2, so that you don't have to repeat the content of the URL list so many times. It's not that NEL and friends would decide which keys are allowed in option-sets; that would still be entirely defined by Reporting, and (so far as of this draft) would only contain sampling-fraction. This would couple NEL and Reporting only in that if you wanted separate sampling rates for NEL successes and failures, you'd have to place two entries into option-sets, but it wouldn't change the syntactic definition of option-sets.

It's the same with option 2: if you wanted separate rates, you'd have to create two endpoint groups.

All of that said, I think I like option 1 the most the more I think of it. We'd add a fair bit of complexity to Reporting to factor out something that's not really that complex: configured rate is a double between 0 and 1, and for each report, roll a die to decide whether to report it or not.

(The only wrinkle is that if NEL has the sampling rate, and doesn't send reports to Reporting if the die roll fails, then the ReportingObserver would never have a chance to see those reports.)

dcreager commented 6 years ago

(The only wrinkle is that if NEL has the sampling rate, and doesn't send reports to Reporting if the die roll fails, then the ReportingObserver would never have a chance to see those reports.)

The counter-argument to this being that if you're using a ReportingObserver, and want to see every report, then set your sampling rate to 1. And if needed, roll another die in your observer (or implement some more complex rate limiting) if you still want to limit what's uploaded to the collector.

dcreager commented 6 years ago

I believe option 2 is the best, as various endpoints may have different tolerances for rate limiting.

If different endpoints have different limits on what load they can handle, I think weights (proposed in #39) are a better approach. Those are per-endpoint. For rate limiting, the same sampling rate would apply to all of the endpoints in the group (for both option 2 and option 3).

igrigorik commented 6 years ago

The counter-argument to this being that if you're using a ReportingObserver, and want to see every report, then set your sampling rate to 1. And if needed, roll another die in your observer (or implement some more complex rate limiting) if you still want to limit what's uploaded to the collector.

It's also not entirely unreasonable to say that ReportingObserver always sees all reports that are eligible for JS delivery, and are not subject to sampling. For example, we can add a flag to "queue the report step" that indicates whether it should be uploaded, and that can be used to implement sampling. On the other hand, we also need an inverse flag for whether report should be made visible to ReportingObserver, so there's some symmetry here. WDYT?

nicjansma commented 5 years ago

I think it could be useful to have either option 1 (each spec defines a way to sample) or option 2 (sampling-fraction per endpoint group in the Reporting API).

We like that NEL already has sampling built in. It seems to make sense for NEL to allow you to specify the sampling rate in its header directly during "registration", since you can't later apply a sampling rate at the time of the report (as NEL means the visitor can't reach your site).

Should the other specs look towards implementing option 1? CSP has the potential for a large amount of reports-per-page and flooding (versus deprecations/interventions/crashes). Feature Policy Reporting also has some interesting things on the horizon with unoptimized-image related policies that could be noisy as well.

Taking CSP as an example, from a usability point of view, I think it's easier for the website (or CDN) to include a report-group-fraction: 0.5 clause directly into all CSP headers, rather than only including the CSP header (or that clause in the header) 50% of the time. For example, without sampling built into CSP, that means you'd be omitting the report-to: portion of the CSP header 50% of the time, as you can't omit the CSP header entirely 50% of the time (as it would turn CSP off). In other words, it's a lot easier for a website/CDN to send one static constant header 100% of the time than 2 (or more) different versions of the same header at a different rate just to enable sampling.

For Feature Policy Reporting, not having a sampling rate is a little awkward as well, since you're either applying Feature Policies via the Feature-Policy[-Report-Only] header or you aren't applying the polices at all. Based on today's spec, if you're applying the Feature Policies you're also committed to reporting on them to the Reporting API, with no way to sample them (without turning FP off entirely).

igrigorik commented 5 years ago

@nicjansma I like the idea of sampling per group (i.e option 2 you outlined above) as it allows us to abstract this common functionality across all the upstream report generators + enables site owners to group "noisy" generators into same policy buckets.

@dcreager @clelland wdyt?

clelland commented 5 years ago

Rate-limiting seems like a pretty cross-cutting concern, that building it into Reporting makes sense. As long as it doesn't stop other specs (such as NEL) from extending it if necessary, I'd support adding the syntax for option 2 here. Centralizing it also means that if we want to implement any further limits -- maybe a hard cap on total reports sent to an endpoint per page load -- that we would have an obvious place to make that change too.

(I suspect that the sampling rates in NEL will just become a multiplicative factor in the final rate, if we go this route, but you could take advantage of that to say something like "success_fraction": 0.05, "error_fraction": 1.0 and have those be relative to the overall reporting sampling rate)

As an aside, Feature policy should probably adopt something like the report-to directive; as @nicjansma says, using it at all when a Report-To header is present means that you will start triggering reports.

dcreager commented 5 years ago

I've come around to option 2, as well, since the comments I wrote way back upthread. I think there's a small tweak that we can make that makes it easy to support NEL's use case with over-complicating the other downstream specs:

Update the queue a report algorithm (which is the hook that downstream specs use) to take in an optional sampling rate parameter.
Add sampling_rate to endpoint groups as we've discussed. Treat it as the default value for the new sampling rate parameter.

So the easy case is "add sampling_rate to your endpoint groups", and all downstream specs would get sampling for free. And the new per-report sampling rate parameter gives us what we need to allow NEL to customize sampling rates for successes and failures. (E.g., NEL would pass in the success or failure fraction from the NEL policy as the value of that parameter. CSP would ignore the parameter, and get the sampling rate from the sampling_rate field of the Reporting policy.)

Thoughts?

clelland commented 5 years ago

I like the idea of making it a default, and allowing it to be overridden, as long as it's the developer overriding it, and not some less obvious spec behaviour. So, the default value would be 1.0, I'm presuming. And NEL would override that, rather than multiplying it.

Would that parameter affect step 7 (notify reporting observers) or just step 5 (append to the report cache)?

w3c / reporting

Rate limiting #45