ietf-wg-ppm / draft-ietf-ppm-dap

This document describes the Distributed Aggregation Protocol (DAP) being developed by the PPM working group at IETF.
Other
41 stars 20 forks source link

Consider removing `max_batch_size`. #555

Open branlwyd opened 1 month ago

branlwyd commented 1 month ago

Once multi-collection of batches has been removed, we should consider whether the max_batch_size parameter of the fixed-size query type might be removed.

See https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/pull/554/files#r1577003240 for context; in short, max_batch_size's strongest use-case is in the case of a multiply-collected VDAF. It is unclear if there is a strong justification for this parameter in a DAP which supports only singly-collected VDAFs.

cjpatton commented 1 month ago

My thoughts here:

  1. A maximum batch size is useful for capacity planning. In particular, the manner in which an aggregator stores report IDs may depend on how large of a batch it needs to plan for. In fact, I think we can make the case to add this parameter to all tasks, not just tasks with this query type.
  2. If we take this change, then the name "fixed-size" no longer really applies, so we should consider renaming it as well.
cjpatton commented 1 month ago

cc/ @wangshan who may have thoughts here.

branlwyd commented 1 month ago

Hmm -- I was thinking of the removal of max_batch_size as indicating that batches must be exactly min_batch_size in size (i.e. if we took this change, we could also rename min_batch_size to just batch_size). If we want to keep the existing semantics that a missing max_batch_size leads to unbounded batch sizes, IMO we should keep max_batch_size.

cjpatton commented 1 month ago

Ah, thanks for clarifying! FWIW, taskprov interprets unspecified max_batch_size as unlimited: https://datatracker.ietf.org/doc/html/draft-wang-ppm-dap-taskprov-06#section-3-9

cjpatton commented 1 week ago

One more thought here, after working out some bottlenecks in our DAP service: in fact, the maximum batch size is important for capacity planning. In order to implement replay protection, existing DAP implementations generally do some form of sharding of stored report IDs in order to mitigate the cost of checking an ID already exists. Your sharding scheme, and the number of shards, depends on the maximum batch size you expect.

Based on this experience, I would actually advocate for the opposite out come: make the maximum batch size a parameter of every task. I would probably also rename fixed size to something like "leader-specified" that is closer to the semantics of the query type.

tgeoghegan commented 3 days ago

I think there's a problem with requiring a maximum batch size for time interval. Suppose the collector request a time interval where there are max_batch_size + min_batch_size reports available for aggregation. Two problems present themselves:

1 - Only max_batch_size of the reports may be aggregated. How do the leader and helper choose which reports get included in the batch and which don't? What if they disagree? (recall that unlike in the fixed size case, the leader and helper independently decide batch membership in time interval) 2 - Subsequently, what happens to the remaining min_batch_size reports? The time interval is now collected, so there's no way for the collector to request an aggregation over them.

Solving either problem would require more DAP changes than just making max_batch_size unconditionally required.

Can you provide more detail on why batch size matters to your sharding strategy? Specifically, would a bound on aggregation job size rather than batch size solve your problem?

tgeoghegan commented 3 days ago

A solution to (1) would be to make time interval work more like fixed size in the specific respect of making the leader responsible for assigning reports to batches. This would have an additional benefit of letting the helper treat every task as fixed-size.

Solving (2) seems tricky. I think you'd need to introduce something like the current_batch/batch ID semantics of fixed size, but compartmentalized by time interval. i.e., the collector would have to make queries not just for some interval of time, but a tuple of an interval and a batch ID (which could be current_batch). So then the collector could query on ([t1, t2), current_batch) until it gets told there are fewer than min_batch_size reports left in [t1, t2), and it could refer to each collection of that time interval by its batch ID.

cjpatton commented 2 days ago

Can you provide more detail on why batch size matters to your sharding strategy? Specifically, would a bound on aggregation job size rather than batch size solve your problem?

I don't see how that this would help, but it's possible we're doing something weird.

Suppose the task is time interval: we would quantize time into windows, and for each window we have an instance of the underlying storage object. In that object, we store the IDs of all reports with timestamps in that window.

For us, the rate of growth is the most important metric: the faster the aggregation rate in a given time window, the more pressure we put on the underlying storage object. To relieve the pressure, we can further shard it into more objects by a deterministic map from report ID to shard number.

But how many shards should we choose? More shards allows you to cope with higher aggregation rate, but it also consumes more resources, since their are more storage objects in existence simultaneously. Knowing how large a batch might help me figure out how many shards I need to cope with a given aggregation rate.

Solving (2) seems tricky. I think you'd need to introduce something like the current_batch/batch ID semantics of fixed size, but compartmentalized by time interval. i.e., the collector would have to make queries not just for some interval of time, but a tuple of an interval and a batch ID (which could be current_batch). So then the collector could query on ([t1, t2), current_batch) until it gets told there are fewer than min_batch_size reports left in [t1, t2), and it could refer to each collection of that time interval by its batch ID.

My first thought here was: "you collected as much data as you need, so throw the rest away". My guess is that you're not happy with that answer :)

Turning this around: are there situations where you don't know a priori how much data to expect? If I have 1,000 users, each reporting about every 10 minutes, then I should expect about 6,000 reports in an hour.

tgeoghegan commented 1 day ago

Turning this around: are there situations where you don't know a priori how much data to expect? If I have 1,000 users, each reporting about every 10 minutes, then I should expect about 6,000 reports in an hour.

If you're that confident in your estimation of batch sizes, then why can't you use that for your sharding heuristic and leave max_batch_size out of the spec?

Noah-Kennedy commented 1 day ago

My thoughts here:

1. A maximum batch size is useful for capacity planning. In particular, the manner in which an aggregator stores report IDs may depend on how large of a batch it needs to plan for. In fact, I think we can make the case to add this parameter to all tasks, not just tasks with this query type.

2. If we take this change, then the name "fixed-size" no longer really applies, so we should consider renaming it as well.

With async jobs as suggested in #557, IMO the usefulness of this for capacity planning largely goes away

cjpatton commented 1 day ago

Per 2024/6/12 DAP sync: @tgeoghegan is correct, I missed something important: the semantics of time interval means a maximum batch size cannot be specified for it. And in light of @Noah-Kennedy's point, it's not going to be important for capcity planning once we've added async aggregation jobs.

branlwyd commented 1 day ago

One more argument against specifying max_batch_size:

min_batch_size is a privacy parameter, and must be evaluated by both Leader & Helper. If they do not agree on the minimum batch size, confusion ensues; therefore, it makes sense to specify min_batch_size.

There is no corresponding argument for max_batch_size; more specifically, there is no requirement that both the Leader & Helper both evaluate a maximum batch size. We could remove max_batch_size from the specification, and leave any maximum size as an implementation detail for the Leader to enforce.

cjpatton commented 1 day ago

This is ready for text @branlwyd. Would you mind putting up a PR?

While at it, we should try to think of a replacement for the "fixed-size" name.

tgeoghegan commented 19 hours ago

I think the salient feature is that the leader constructs batches, so the name should reflect that. leader-chosen? I dunno.

cjpatton commented 19 hours ago

That works for me.