Closed wangshan closed 2 years ago
In this case the aggregators no longer need to worry about min_batch_duration
. Leader can store last know Tm to filter out late arrivals if needed.
This seems similar to my suggestion in #218 of having a batch_filter
to allow more flexible specification of subsets of reports.
Yes, batch_filter
sounds like a more generic name that can incorporate both batch interval and batch id.
But I think allowing advanced filter can open up privacy concerns. When slicing the batches, one not only have to make sure the sliced batch meets all privacy guarantees, but also all the deltas with previously slices. We may find a way to do this with time intervals, but adding other metadata like region will make it extremely hard. These metadata will likely be related to client measurements, so I think it's better to encode such information in the measurement, or designing different tasks for them.
There are use cases where drill down or fine-grained filtering is not needed, usr simply want aggregate results with privacy guarantee.
Agreed, previous discussion on this topic is mainly here: #195
I think this would be useful. Let me check if I understand the protocol changes that are required:
min_batch_size
measurements. Reports are sorted into to batches by the leader.For batch-based tasks, an aggregate result corresponds to a unique set of exactly min_batch_size measurements. Reports are sorted into to batches by the leader.
I think exact min_batch_size
maybe too strong, >= should be sufficient, so implementations can control how close to min_batch_size
they want
Per discussion in IETF 114, let's aim to support this in the next version of the draft, with a simpler (but equivalent) version of the existing query validation.
Looking at the implementation strategy for #297, I think there is a different approach which would be less disruptive to the existing design/provide better unification between the different query types, allow better precision w.r.t the number of reports per batch in the new fixed-size
case, add less complexity to the helper, and give the leader less power to choose how reports are mapped into batches. (I suggested this in a review comment on #297, but @simon-friedberger correctly pointed out that this discussion would be less confusing in a GH issue.)
The high-level idea is: the current approach in #297 for fixed-size
tasks is to bind aggregation jobs to batches (identified by an opaque "batch ID") at time of creation of the aggregation job. Instead, we could identify batches in fixed-size
tasks by a batch interval (as is currently done with time-interval
task types), dropping the concept of an opaque batch ID entirely. Another way of stating this idea is that, rather than introducing a new query type, we view this new feature as choosing how batch intervals are decided: time-interval
generates batch intervals on a fixed cadence, fixed-size
generates batch intervals dynamically based on the rate of incoming reports.
fixed-size
& time-interval
task types. The largest difference would be in batch validation: time-interval
batch intervals would be considered valid if the interval's start
& duration
are multiples of min_batch_duration
(i.e. same as today), fixed-size
batch intervals would be considered valid if they contain exactly[1] min_batch_size
reports. (This is one part of the suggested design that adds complexity to the helper -- both the leader & the helper would need to implement this batch validation logic.)min_batch_size
nor overflow max_batch_size
without having to stop generating aggregation jobs at batch boundaries until enough existing aggregation jobs complete to confirm that the batch won't be too small. (This doesn't change how a helper would operate; all the helper needs to do is verify that the chosen batches are of the correct size as suggested in the previous point.)fixed-size
tasks, since the batch intervals are based on the rate of incoming reports which the collector does not have insight into. In the interest of keeping the task types unified, this could also be implemented for time-interval
task types. (All of this complexity falls onto the leader only.)I touch on a few more points in the original PR comment but this captures the big idea.
[1] #297 includes min_batch_duration
and max_batch_duration
to provide the leader some leeway in producing acceptably-sized batches. I think this is no longer required as the leader can always choose an interval to contain an exact number of reports (assuming we include a way to break timestamp ties, which is possible but probably a little messy).
[2] Somewhat off-topic from the issue at hand, but I think allowing collect requests to be made before a batch interval is complete is valuable for performance reasons, too. Tasks using a VDAF with a trivial (i.e. ()
) aggregation parameter can incrementally aggregate reports as they arrive. This isn't currently possible for tasks with a nontrivial aggregation parameter, since the aggregation parameter to use is not communicated to the aggregators until the collect request is made. But if the collect request arrives before the batch is complete, we can start incrementally aggregating reports at that point. This, of course, assumes that the collector knows the aggregation parameter before the batch interval is complete -- seems like a fair assumption but may not always be true?
@branlwyd I think it's time to flesh out your idea in a PR. I would suggest cribbing off of #297, i.e., creating a new branch based on the current one.
My main requirements are as follows:
fixed-size
), we want to make sure to avoid "throwing away" reports due to constrains imposed by batch intervals. I'm not exactly sure how to achieve this, but maybe it'll become clear once the PR is spelled out.@branlwyd @cjpatton Choosing batch-interval dynamically is not ideal because:
1). As @junyechen1996 and others have mentioned, the coarse-grained timestamp would prevent aggregators from effectively defining intervals in fixed-size
query. For e.g. a task with timestamp precision of 1hr cannot have a dynamic interval less than 1hr. If the min_batch_size is achieved in 10min, then the use case cannot be supported, unless we throw away all reports after min_batch_size
, which can be very wasteful.
2). In the fixed-size
query, time and order don't matter, aggregators generally don't care about Report timestamps (other than the ones in distant future or past), any late-arrivals can be added to the next batch if another collect request is received. This allows the implementations more flexibility
whether a leader can generate aggregation jobs that are guaranteed to neither underflow min_batch_size nor overflow max_batch_size without having to stop generating aggregation jobs at batch boundaries until enough existing aggregation jobs complete to confirm that the batch won't be too small.
I think this can be worked out by leader itself. It is the leader's responsibility to create a AggregateShareReq
only when min_batch_size
amount of reports has been verified successfully (I believe in interval query the leader is responsible for only creating AggregateShareReq after the current time has elapsed the current interval end). If not, then leader should initialise another {agg-init}
with a new AggregationJob containing the number of ReportShares missing from min_batch_size
.
This is to say an AggregationJob
is bound to a batch ID at {agg-init}
, but the full list of AggregationJob
s are only clear when {collect-flow}
happened. Note the newly created AggregationJob
may interleaves reports with another batch ID, but it's ok as long as no time interval is involved.
On a more general term, I'd prefer we support different collect requirements explicitly and separately at this stage, once we know more about their behaviour and the challenges in implementation, we can consider unifying if necessary. Otherwise we may find ourselves unable to support use case of one type without breaking the other type.
@cjpatton we should define what exactly does "(3.) a query determines a unique batch" means.
From the PR:
These difficulties beg the following essential question: How important is property (3.)? Without it, the Leader is capable of sorting reports into batches as it pleases. For example: The Leader can batch reports by User-Agent, location, and so on. This is is still possible with (3.), but this would at least force the Leader to "throw away" reports from clients that are not in the target group, but pertain to the query. The Leader can do a kind of weak stuffing attack. Stuffing attacks ares still possible with (3.), but again, it would at least force the Leader to throw away any reports that pertain to the query.
This is a valid concern but since with pure interval based collection leader can still do this, I don't see what addressing this can improve. Also worth pointing out that if the task is protected by differential privacy, and client is sending information like user-agent in the clear, then the task should accept that group batches by the same user-agent is still privacy preserving as long as min_batch_size
is met.
Thanks for your feedback; I have sent #300 which hopefully makes the approach I am suggesting clear. To respond to a few of the concerns raised in this issue:
1). As @junyechen1996 and others have mentioned, the coarse-grained timestamp would prevent aggregators from effectively defining intervals in fixed-size query. For e.g. a task with timestamp precision of 1hr cannot have a dynamic interval less than 1hr. If the min_batch_size is achieved in 10min, then the use case cannot be supported, unless we throw away all reports after min_batch_size, which can be very wasteful.
The randomness is used as a timestamp tie-breaker, so a fixed-size
task can create many batches for the same timestamp value if necessary due to rounding. I think that, if the Leader waits to allow collection of a batch to account for clock skew + timestamp truncation, no reports will ever need to be dropped.
It is the leader's responsibility to create a AggregateShareReq only when min_batch_size amount of reports has been verified successfully (I believe in interval query the leader is responsible for only creating AggregateShareReq after the current time has elapsed the current interval end). If not, then leader should initialise another {agg-init} with a new AggregationJob containing the number of ReportShares missing from min_batch_size.
This only works if the batch intervals are allowed to overlap, as you note later in your comment. Other than that, this is somewhat inefficient: a few failed report aggregations can delay a batch from being collectable until another aggregation job is run, and the Leader will only realize it needs to run another aggregation job once the prior aggregation jobs have completed. This "extra" aggregation job is also fairly likely to be quite small compared to the normal batch sizes in use.
On a more general term, I'd prefer we support different collect requirements explicitly and separately at this stage, once we know more about their behaviour and the challenges in implementation, we can consider unifying if necessary.
The current approach in #297 (IMO) effectively creates two similar-but-separate protocols embedded inside the top-level DAP protocol. The approach I suggest in #300 achieves the same overall goal (i.e. generate batches as soon as min_batch_size
is hit) by adding a few small, mostly-orthogonal features to the existing protocol. In my experience, the latter approach leads to a more maintainable & extensible design.
This is a valid concern but since with pure interval based collection leader can still do this, I don't see what addressing this can improve.
The Leader can only do this if they throw away any reports that don't meet their criteria. Open issue #130 may end up with the client sending report shares directly to the leader & helper; if so, at that point the Helper will be able to detect if the Leader is dropping reports.
How large is the timestamp rounding (i.e. truncation) duration? If it has to be too large the delay would be problematic; I'm not sure what is generally considered large enough to protect privacy.
This is a valid concern but since with pure interval based collection leader can still do this, I don't see what addressing this can improve.
The difference is that with interval based collection the leader is restricted to splitting into:
batch_interval.start
up to batch_interval.duration
(which must be divisible by min_batch_duration
)With fixed-size-batch the leader can just insert min_batch_size-1
reports with the same timestamp as the victim and it works.
How large is the timestamp rounding (i.e. truncation) duration? If it has to be too large the delay would be problematic; I'm not sure what is generally considered large enough to protect privacy.
For our use case it'll have to be hours.
@simon-friedberger what is the attack in this case, if we are talking about sybil attack, then I don't see what's the difference in the two types: for interval query, leader can generate fake reports with timestamp in (batch_interval.start, batch_interval.duration)
; in fixed-size query, leader can generate fake reports with any timestamp; How the collection happens doesn't affect leader's ability to be malicious, am I missing something?
@branlwyd
The randomness is used as a timestamp tie-breaker, so a fixed-size task can create many batches for the same timestamp value if necessary due to rounding. I think that, if the Leader waits to allow collection of a batch to account for clock skew + timestamp truncation, no reports will ever need to be dropped.
One reason to support fixed-size collection is to deliver an aggregation result as soon as the batch size condition is met. If leader has to wait then it might defeat this purpose. In my example above, if I know the task will usually aggregate min_batch_size
in 10min, but due to privacy constraints the most precise timestamp I can have is in 1 hour, then I'll have to wait for 1hr to get a batch which would have been available after 10min. Whether to drop the extra reports is kind of a secondary concern.
This only works if the batch intervals are allowed to overlap, as you note later in your comment. Other than that, this is somewhat inefficient: a few failed report aggregations can delay a batch from being collectable until another aggregation job is run, and the Leader will only realize it needs to run another aggregation job once the prior aggregation jobs have completed. This "extra" aggregation job is also fairly likely to be quite small compared to the normal batch sizes in use.
True, but my point was the interval doesn't matter, so there's no real "overlap". Besides, I think it's better for privacy if the chronological order of reports are not preserved inter- or intra-batches, when slicing on time window is not required. We can work around the inefficiency above by allowing leader to collect slightly more than min_batch_size
but still less than max_batch_size
, to minimise the need of "extra" aggregation jobs. After all, verification failure should be rare unless the system is under attack.
The current approach in https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/pull/297 (IMO) effectively creates two similar-but-separate protocols embedded inside the top-level DAP protocol.
That's indeed my preference, the advantage is adopters that are only interested in one query type don't need to worry about the other one so much. I'm not against unifying implementation details, but I think it's better to have two distinct sub-protocols that support the two use cases well, than one sub-protocol that has to compromise. At this point, I think I should go read your PR :)
I think that the need to round timestamps on the order of hours likely disallows the approach in #300, since that approach may delay batches for a period of time up to the timestamp rounding factor. I'm going to retract that PR & add a few comments on #297.
for interval query, leader can generate fake reports with timestamp in
(batch_interval.start, batch_interval.duration)
; in fixed-size query, leader can generate fake reports with any timestamp; How the collection happens doesn't affect leader's ability to be malicious, am I missing something?
@wangshan I don't think you are. It's a minor difference in the power of the attacker. In the first case other reports in the interval have to either be dropped or will add noise. In the second the attack can be executed more precisely. As stated on the mailing list, I don't think this has a big impact either.
One reason to support fixed-size collection is to deliver an aggregation result as soon as the batch size condition is met. If leader has to wait then it might defeat this purpose. In my example above, if I know the task will usually aggregate
min_batch_size
in 10min, but due to privacy constraints the most precise timestamp I can have is in 1 hour, then I'll have to wait for 1hr to get a batch which would have been available after 10min. Whether to drop the extra reports is kind of a secondary concern.
Is this coherent? Even if the timestamp is 1 hour for privacy, the correct timestamp is probably in the last 10 minutes...
Is this coherent? Even if the timestamp is 1 hour for privacy, the correct timestamp is probably in the last 10 minutes...
At least one privacy violation I can think of requires knowing the clock skew between client & aggregator. Specifically, a malicious leader could bucket reports by apparent clock skew between client timestamp & server timestamp. Then if they learn a client device's clock skew, they can correlate the reports in the relevant buckets as likely having been generated by that device.
Rounding timestamps to a large enough degree protects against this attack, since all reports will fall into the same bucket. (Though I think if the malicious actor could control when a client submitted reports, they might be able to figure out bounds on the client's clock skew by triggering a report upload near a rounding boundary and seeing how the client rounds the timestamp.) And even if we know the report arrived at the server in the last 10 minutes, that doesn't tell us what the clock skew between client & aggregator is.
That said, I'm curious if there are other privacy violations to consider here -- has the threat model around the client timestamp been discussed somewhere already?
The main concern about timestamps -- or any other report metadata -- is that it creates a fingerprinting surface that a Helper can use to control the reports that end up in the batch. Of course it should be noted that the Leader also has this capability, and more.
My primary concerns about #300 (as it's currently spelled) are more operational:
time_precision
is large. The amount of storage Aggregators need depend on this value; the larger the better.@branlwyd I originally brought this up in this issue: https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/274
The threat I was referring to is the following: if there is a proxy between client and aggregator (like the one described in https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/294), and the proxy is owned by the leader, (for e.g leader's edge server). Assuming we encrypt the report from client to the proxy and to aggregator (so no one other than leader sees the timestamp), if the timestamp is precise, then an attacker from leader with access to the input of the proxy can look up the time packages arrived, and figure out which client the report is coming from (maybe with help of other info like size of the package). Rounding timestamp makes this a lot harder.
In the fixed-size query case, the main purpose of the timestamp is to allow aggregators filter out very old (or very future) reports.
If the collector is not interested in time and interval, but simply want to collect aggregation in a batch that meets
min_batch_size
. Then can the protocol support a 'batch-based collection' instead of current interval-based?Consider a Prio usecase, when
max_batch_lifetime == 1
, collector will only need to collect the aggregation so far, with a batch size B >=min_batch_size
. This can be orchestrated by the leader, which can implement a counting service to track batch size for a task, once it reachesmin_batch_size
, leader sends AggregateShareReq to collect helper'saggregate_share
and return to collector.This requires a new id to tie agg-flow with agg-share-flow. For example, in addition to
agg_job_id
, leader can send a uniquebatch_id
in every AggregateReq. At collect time, leader use the samebatch_id
to collectoutput_share
in helper (helper can still proactively aggregate output_shares toaggregate_share
, since there are no more batch windows, helper can storeaggregate_share
byagg_job_id
, or accumulate all aggregation jobs'output_share
to oneaggregate_share
, and store it withbatch_id
). Illustrated as following:Here [T0, Tm] is the time takes to aggregate
min_batch_size
number of reports, it has no relationship withmin_batch_duration
or collector interval.As this issue pointed out: https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/195, to avoid privacy leak, each batch window must contain at least
min_batch_size
reports, otherwise attacker can find ways to slice intervals to break privacy guarantee. But if the protocol does require each batch window meetsmin_batch_size
, then the collect interval itself is no longer useful, since the duration that takes to meetmin_batch_size
is the smallest duration that can be queried. Therefore, it seems to make sense to base collection entirely on batch size, not interval.