Collect without interval

ietf-wg-ppm / draft-ietf-ppm-dap

This document describes the Distributed Aggregation Protocol (DAP) being developed by the PPM working group at IETF.

Other

46 stars 22 forks source link

Collect without interval #273

Closed wangshan closed 2 years ago

wangshan commented 2 years ago

If the collector is not interested in time and interval, but simply want to collect aggregation in a batch that meets min_batch_size. Then can the protocol support a 'batch-based collection' instead of current interval-based?

Consider a Prio usecase, when max_batch_lifetime == 1, collector will only need to collect the aggregation so far, with a batch size B >= min_batch_size. This can be orchestrated by the leader, which can implement a counting service to track batch size for a task, once it reaches min_batch_size, leader sends AggregateShareReq to collect helper's aggregate_share and return to collector.

This requires a new id to tie agg-flow with agg-share-flow. For example, in addition to agg_job_id, leader can send a unique batch_id in every AggregateReq. At collect time, leader use the same batch_id to collect output_share in helper (helper can still proactively aggregate output_shares to aggregate_share, since there are no more batch windows, helper can store aggregate_share by agg_job_id, or accumulate all aggregation jobs' output_share to one aggregate_share, and store it with batch_id). Illustrated as following:

|<------------------------ batch_id1 ----------------->| <== AggregationShareReq
|    agg_job_id1     |   agg_job_id2   |   agg_job_id3 | <== AggregateReq
T0                                                    Tm <== Time

Here [T0, Tm] is the time takes to aggregate min_batch_size number of reports, it has no relationship with min_batch_duration or collector interval.

As this issue pointed out: https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/195, to avoid privacy leak, each batch window must contain at least min_batch_size reports, otherwise attacker can find ways to slice intervals to break privacy guarantee. But if the protocol does require each batch window meets min_batch_size, then the collect interval itself is no longer useful, since the duration that takes to meet min_batch_size is the smallest duration that can be queried. Therefore, it seems to make sense to base collection entirely on batch size, not interval.

wangshan commented 2 years ago

In this case the aggregators no longer need to worry about min_batch_duration. Leader can store last know Tm to filter out late arrivals if needed.

simon-friedberger commented 2 years ago

This seems similar to my suggestion in #218 of having a batch_filter to allow more flexible specification of subsets of reports.

wangshan commented 2 years ago

Yes, batch_filter sounds like a more generic name that can incorporate both batch interval and batch id.

But I think allowing advanced filter can open up privacy concerns. When slicing the batches, one not only have to make sure the sliced batch meets all privacy guarantees, but also all the deltas with previously slices. We may find a way to do this with time intervals, but adding other metadata like region will make it extremely hard. These metadata will likely be related to client measurements, so I think it's better to encode such information in the measurement, or designing different tasks for them.

There are use cases where drill down or fine-grained filtering is not needed, usr simply want aggregate results with privacy guarantee.

simon-friedberger commented 2 years ago

Agreed, previous discussion on this topic is mainly here: #195

cjpatton commented 2 years ago

I think this would be useful. Let me check if I understand the protocol changes that are required:

The task config indicates whether collection is interval-based or batch-based.
For batch-based tasks, the collector specifies a batch ID rather than a time interval. Perhaps batch IDs are generated deterministically (e.g., first batch is batch ID = 0, second is batch ID = 1, and so on) so that the Leader can pre-aggregate reports for VDAFs that permit this (i.e., when the aggregation parameter is known).
For batch-based tasks, an aggregate result corresponds to a unique set of exactly min_batch_size measurements. Reports are sorted into to batches by the leader.

wangshan commented 2 years ago

For batch-based tasks, an aggregate result corresponds to a unique set of exactly min_batch_size measurements. Reports are sorted into to batches by the leader.

I think exact min_batch_size maybe too strong, >= should be sufficient, so implementations can control how close to min_batch_size they want

chris-wood commented 2 years ago

Per discussion in IETF 114, let's aim to support this in the next version of the draft, with a simpler (but equivalent) version of the existing query validation.

branlwyd commented 2 years ago

Looking at the implementation strategy for #297, I think there is a different approach which would be less disruptive to the existing design/provide better unification between the different query types, allow better precision w.r.t the number of reports per batch in the new fixed-size case, add less complexity to the helper, and give the leader less power to choose how reports are mapped into batches. (I suggested this in a review comment on #297, but @simon-friedberger correctly pointed out that this discussion would be less confusing in a GH issue.)

The high-level idea is: the current approach in #297 for fixed-size tasks is to bind aggregation jobs to batches (identified by an opaque "batch ID") at time of creation of the aggregation job. Instead, we could identify batches in fixed-size tasks by a batch interval (as is currently done with time-interval task types), dropping the concept of an opaque batch ID entirely. Another way of stating this idea is that, rather than introducing a new query type, we view this new feature as choosing how batch intervals are decided: time-interval generates batch intervals on a fixed cadence, fixed-size generates batch intervals dynamically based on the rate of incoming reports.

This approach would allow unifying almost all of the functionality/struct definitions between the fixed-size & time-interval task types. The largest difference would be in batch validation: time-interval batch intervals would be considered valid if the interval's start & duration are multiples of min_batch_duration (i.e. same as today), fixed-size batch intervals would be considered valid if they contain exactly[1] min_batch_size reports. (This is one part of the suggested design that adds complexity to the helper -- both the leader & the helper would need to implement this batch validation logic.)
This also allows reports in a single aggregation job to be matched to different batches, which sidesteps some hairy technical concerns (PR discussion) around whether a leader can generate aggregation jobs that are guaranteed to neither underflow min_batch_size nor overflow max_batch_size without having to stop generating aggregation jobs at batch boundaries until enough existing aggregation jobs complete to confirm that the batch won't be too small. (This doesn't change how a helper would operate; all the helper needs to do is verify that the chosen batches are of the correct size as suggested in the previous point.)
297 suggests adding functionality to allow the collector to ask the leader what the next batch is. We would still need this for fixed-size tasks, since the batch intervals are based on the rate of incoming reports which the collector does not have insight into. In the interest of keeping the task types unified, this could also be implemented for time-interval task types. (All of this complexity falls onto the leader only.)
297 also changes when reports are no longer accepted: the previous semantics were "reports are no longer accepted once a collect request is received", the new semantics are "reports are no longer accepted once results are collected". I think we still want this change [2], since otherwise the "tell-me-what-the-next-batch-is" functionality from the previous point couldn't be used safely without potentially cutting off a batch prematurely. (I think this change would also need to be made to the helper, but IMO it's no more complex than the existing requirement.)

I touch on a few more points in the original PR comment but this captures the big idea.

[1] #297 includes min_batch_duration and max_batch_duration to provide the leader some leeway in producing acceptably-sized batches. I think this is no longer required as the leader can always choose an interval to contain an exact number of reports (assuming we include a way to break timestamp ties, which is possible but probably a little messy).

[2] Somewhat off-topic from the issue at hand, but I think allowing collect requests to be made before a batch interval is complete is valuable for performance reasons, too. Tasks using a VDAF with a trivial (i.e. ()) aggregation parameter can incrementally aggregate reports as they arrive. This isn't currently possible for tasks with a nontrivial aggregation parameter, since the aggregation parameter to use is not communicated to the aggregators until the collect request is made. But if the collect request arrives before the batch is complete, we can start incrementally aggregating reports at that point. This, of course, assumes that the collector knows the aggregation parameter before the batch interval is complete -- seems like a fair assumption but may not always be true?

cjpatton commented 2 years ago

@branlwyd I think it's time to flesh out your idea in a PR. I would suggest cribbing off of #297, i.e., creating a new branch based on the current one.

My main requirements are as follows:

It should be possible for query types to provide Property (3.).
For query types with a max batch size (e.g., fixed-size), we want to make sure to avoid "throwing away" reports due to constrains imposed by batch intervals. I'm not exactly sure how to achieve this, but maybe it'll become clear once the PR is spelled out.
I would like to be able to de-couple storage requirements from aggregation rate. I.e, regardless of whether the Leader runs 100 reports per aggregation job or 10,000, I should be able to "pre-aggregate" output shares in the same way.

wangshan commented 2 years ago

@branlwyd @cjpatton Choosing batch-interval dynamically is not ideal because:

1). As @junyechen1996 and others have mentioned, the coarse-grained timestamp would prevent aggregators from effectively defining intervals in fixed-size query. For e.g. a task with timestamp precision of 1hr cannot have a dynamic interval less than 1hr. If the min_batch_size is achieved in 10min, then the use case cannot be supported, unless we throw away all reports after min_batch_size, which can be very wasteful.

2). In the fixed-size query, time and order don't matter, aggregators generally don't care about Report timestamps (other than the ones in distant future or past), any late-arrivals can be added to the next batch if another collect request is received. This allows the implementations more flexibility

whether a leader can generate aggregation jobs that are guaranteed to neither underflow min_batch_size nor overflow max_batch_size without having to stop generating aggregation jobs at batch boundaries until enough existing aggregation jobs complete to confirm that the batch won't be too small.

I think this can be worked out by leader itself. It is the leader's responsibility to create a AggregateShareReq only when min_batch_size amount of reports has been verified successfully (I believe in interval query the leader is responsible for only creating AggregateShareReq after the current time has elapsed the current interval end). If not, then leader should initialise another {agg-init} with a new AggregationJob containing the number of ReportShares missing from min_batch_size. This is to say an AggregationJob is bound to a batch ID at {agg-init}, but the full list of AggregationJobs are only clear when {collect-flow} happened. Note the newly created AggregationJob may interleaves reports with another batch ID, but it's ok as long as no time interval is involved.

On a more general term, I'd prefer we support different collect requirements explicitly and separately at this stage, once we know more about their behaviour and the challenges in implementation, we can consider unifying if necessary. Otherwise we may find ourselves unable to support use case of one type without breaking the other type.

wangshan commented 2 years ago

@cjpatton we should define what exactly does "(3.) a query determines a unique batch" means.

From the PR:

These difficulties beg the following essential question: How important is property (3.)? Without it, the Leader is capable of sorting reports into batches as it pleases. For example: The Leader can batch reports by User-Agent, location, and so on. This is is still possible with (3.), but this would at least force the Leader to "throw away" reports from clients that are not in the target group, but pertain to the query. The Leader can do a kind of weak stuffing attack. Stuffing attacks ares still possible with (3.), but again, it would at least force the Leader to throw away any reports that pertain to the query.

This is a valid concern but since with pure interval based collection leader can still do this, I don't see what addressing this can improve. Also worth pointing out that if the task is protected by differential privacy, and client is sending information like user-agent in the clear, then the task should accept that group batches by the same user-agent is still privacy preserving as long as min_batch_size is met.

branlwyd commented 2 years ago

Thanks for your feedback; I have sent #300 which hopefully makes the approach I am suggesting clear. To respond to a few of the concerns raised in this issue:

1). As @junyechen1996 and others have mentioned, the coarse-grained timestamp would prevent aggregators from effectively defining intervals in fixed-size query. For e.g. a task with timestamp precision of 1hr cannot have a dynamic interval less than 1hr. If the min_batch_size is achieved in 10min, then the use case cannot be supported, unless we throw away all reports after min_batch_size, which can be very wasteful.

The randomness is used as a timestamp tie-breaker, so a fixed-size task can create many batches for the same timestamp value if necessary due to rounding. I think that, if the Leader waits to allow collection of a batch to account for clock skew + timestamp truncation, no reports will ever need to be dropped.

It is the leader's responsibility to create a AggregateShareReq only when min_batch_size amount of reports has been verified successfully (I believe in interval query the leader is responsible for only creating AggregateShareReq after the current time has elapsed the current interval end). If not, then leader should initialise another {agg-init} with a new AggregationJob containing the number of ReportShares missing from min_batch_size.

This only works if the batch intervals are allowed to overlap, as you note later in your comment. Other than that, this is somewhat inefficient: a few failed report aggregations can delay a batch from being collectable until another aggregation job is run, and the Leader will only realize it needs to run another aggregation job once the prior aggregation jobs have completed. This "extra" aggregation job is also fairly likely to be quite small compared to the normal batch sizes in use.

On a more general term, I'd prefer we support different collect requirements explicitly and separately at this stage, once we know more about their behaviour and the challenges in implementation, we can consider unifying if necessary.

The current approach in #297 (IMO) effectively creates two similar-but-separate protocols embedded inside the top-level DAP protocol. The approach I suggest in #300 achieves the same overall goal (i.e. generate batches as soon as min_batch_size is hit) by adding a few small, mostly-orthogonal features to the existing protocol. In my experience, the latter approach leads to a more maintainable & extensible design.

This is a valid concern but since with pure interval based collection leader can still do this, I don't see what addressing this can improve.

The Leader can only do this if they throw away any reports that don't meet their criteria. Open issue #130 may end up with the client sending report shares directly to the leader & helper; if so, at that point the Helper will be able to detect if the Leader is dropping reports.

branlwyd commented 2 years ago

How large is the timestamp rounding (i.e. truncation) duration? If it has to be too large the delay would be problematic; I'm not sure what is generally considered large enough to protect privacy.

simon-friedberger commented 2 years ago

This is a valid concern but since with pure interval based collection leader can still do this, I don't see what addressing this can improve.

The difference is that with interval based collection the leader is restricted to splitting into:

everybody before
victim timestamp + everybody in the same batch_interval.start up to batch_interval.duration (which must be divisible by min_batch_duration)
everybody after

With fixed-size-batch the leader can just insert min_batch_size-1 reports with the same timestamp as the victim and it works.

wangshan commented 2 years ago

How large is the timestamp rounding (i.e. truncation) duration? If it has to be too large the delay would be problematic; I'm not sure what is generally considered large enough to protect privacy.

For our use case it'll have to be hours.

wangshan commented 2 years ago

@simon-friedberger what is the attack in this case, if we are talking about sybil attack, then I don't see what's the difference in the two types: for interval query, leader can generate fake reports with timestamp in (batch_interval.start, batch_interval.duration); in fixed-size query, leader can generate fake reports with any timestamp; How the collection happens doesn't affect leader's ability to be malicious, am I missing something?

wangshan commented 2 years ago

@branlwyd

The randomness is used as a timestamp tie-breaker, so a fixed-size task can create many batches for the same timestamp value if necessary due to rounding. I think that, if the Leader waits to allow collection of a batch to account for clock skew + timestamp truncation, no reports will ever need to be dropped.

One reason to support fixed-size collection is to deliver an aggregation result as soon as the batch size condition is met. If leader has to wait then it might defeat this purpose. In my example above, if I know the task will usually aggregate min_batch_size in 10min, but due to privacy constraints the most precise timestamp I can have is in 1 hour, then I'll have to wait for 1hr to get a batch which would have been available after 10min. Whether to drop the extra reports is kind of a secondary concern.

This only works if the batch intervals are allowed to overlap, as you note later in your comment. Other than that, this is somewhat inefficient: a few failed report aggregations can delay a batch from being collectable until another aggregation job is run, and the Leader will only realize it needs to run another aggregation job once the prior aggregation jobs have completed. This "extra" aggregation job is also fairly likely to be quite small compared to the normal batch sizes in use.

True, but my point was the interval doesn't matter, so there's no real "overlap". Besides, I think it's better for privacy if the chronological order of reports are not preserved inter- or intra-batches, when slicing on time window is not required. We can work around the inefficiency above by allowing leader to collect slightly more than min_batch_size but still less than max_batch_size, to minimise the need of "extra" aggregation jobs. After all, verification failure should be rare unless the system is under attack.

The current approach in https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/pull/297 (IMO) effectively creates two similar-but-separate protocols embedded inside the top-level DAP protocol.

That's indeed my preference, the advantage is adopters that are only interested in one query type don't need to worry about the other one so much. I'm not against unifying implementation details, but I think it's better to have two distinct sub-protocols that support the two use cases well, than one sub-protocol that has to compromise. At this point, I think I should go read your PR :)

branlwyd commented 2 years ago

I think that the need to round timestamps on the order of hours likely disallows the approach in #300, since that approach may delay batches for a period of time up to the timestamp rounding factor. I'm going to retract that PR & add a few comments on #297.

simon-friedberger commented 2 years ago

for interval query, leader can generate fake reports with timestamp in (batch_interval.start, batch_interval.duration); in fixed-size query, leader can generate fake reports with any timestamp; How the collection happens doesn't affect leader's ability to be malicious, am I missing something?

@wangshan I don't think you are. It's a minor difference in the power of the attacker. In the first case other reports in the interval have to either be dropped or will add noise. In the second the attack can be executed more precisely. As stated on the mailing list, I don't think this has a big impact either.

One reason to support fixed-size collection is to deliver an aggregation result as soon as the batch size condition is met. If leader has to wait then it might defeat this purpose. In my example above, if I know the task will usually aggregate min_batch_size in 10min, but due to privacy constraints the most precise timestamp I can have is in 1 hour, then I'll have to wait for 1hr to get a batch which would have been available after 10min. Whether to drop the extra reports is kind of a secondary concern.

Is this coherent? Even if the timestamp is 1 hour for privacy, the correct timestamp is probably in the last 10 minutes...

branlwyd commented 2 years ago

Is this coherent? Even if the timestamp is 1 hour for privacy, the correct timestamp is probably in the last 10 minutes...

At least one privacy violation I can think of requires knowing the clock skew between client & aggregator. Specifically, a malicious leader could bucket reports by apparent clock skew between client timestamp & server timestamp. Then if they learn a client device's clock skew, they can correlate the reports in the relevant buckets as likely having been generated by that device.

Rounding timestamps to a large enough degree protects against this attack, since all reports will fall into the same bucket. (Though I think if the malicious actor could control when a client submitted reports, they might be able to figure out bounds on the client's clock skew by triggering a report upload near a rounding boundary and seeing how the client rounds the timestamp.) And even if we know the report arrived at the server in the last 10 minutes, that doesn't tell us what the clock skew between client & aggregator is.

That said, I'm curious if there are other privacy violations to consider here -- has the threat model around the client timestamp been discussed somewhere already?

cjpatton commented 2 years ago

The main concern about timestamps -- or any other report metadata -- is that it creates a fingerprinting surface that a Helper can use to control the reports that end up in the batch. Of course it should be noted that the Leader also has this capability, and more.

My primary concerns about #300 (as it's currently spelled) are more operational:

As @wangshan mentioned, latency if time_precision is large. The amount of storage Aggregators need depend on this value; the larger the better.
The PR allows us to "borrow" reports from a previous time interval to fill the next batch, which is good. But in order to do so, I would have to store output shares individually, rather than in aggregate, in order to be able determine at aggregation time which shares fall into the batch range. Unless I'm missing something? cc/ @branlwyd

wangshan commented 2 years ago

@branlwyd I originally brought this up in this issue: https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/274

The threat I was referring to is the following: if there is a proxy between client and aggregator (like the one described in https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/294), and the proxy is owned by the leader, (for e.g leader's edge server). Assuming we encrypt the report from client to the proxy and to aggregator (so no one other than leader sees the timestamp), if the timestamp is precise, then an attacker from leader with access to the input of the proxy can look up the time packages arrived, and figure out which client the report is coming from (maybe with help of other info like size of the package). Rounding timestamp makes this a lot harder.

In the fixed-size query case, the main purpose of the timestamp is to allow aggregators filter out very old (or very future) reports.