Open cjpatton opened 1 month ago
One thing we touched on today is: what is the "unit" of backoff, e.g. what is the domain over which the retry-after header applies to?
Plausible ideas include:
For Daphne, it's for jobs sharing a batch "bucket":
Just wanted to quickly document here an alternative mentioned by @divergentdave on a call: allow aggregation jobs to be asynchronous. In response to an aggregation job initialization request, the helper is allowed to respond with 201 Created without producing a response. The leader would then poll the helper to see if the response is ready, similar to how the collector polls the leader on collection jobs.
This gets a little complicated for multi-round VDAFs, but I think we can figure this out.
cc/ @Noah-Kennedy
Just wanted to quickly document here an alternative mentioned by @divergentdave on a call: allow aggregation jobs to be asynchronous. In response to an aggregation job initialization request, the helper is allowed to respond with 201 Created without producing a response. The leader would then poll the helper to see if the response is ready, similar to how the collector polls the leader on collection jobs.
This gets a little complicated for multi-round VDAFs, but I think we can figure this out.
cc/ @Noah-Kennedy
I quite like this approach, as it gives a lot more flexibility to implementations than the current status quo.
This makes scaling up the protocol a lot easier.
This shouldn't take too much protocol text to achieve: we'd need to clarify that it's OK for the helper to respond to PUT /tasks/{task_id}/aggregation_jobs/{aggregation_job_id}
with 201 Created and then spell out that the leader should then poll GET /tasks/{task_id}/aggregation_jobs/{aggregation_job_id}
until they get 200 OK and an AggregationJobResp
(basically the same semantics as for collection jobs).
Tim's comment raises a good question: if we take this, should we allow both asynchronous & synchronous aggregation job behavior, or only asynchronous?
IMO, it would be nice if we could specify only one of asynchronous/synchronous behavior: requiring implementations to support both modes would be added complexity; specifying both modes but allowing implementations to only implement one mode could lead to interoperability problems between aggregator implementations. But maybe that is too hopeful?
(In general, I think asynchronous aggregation jobs are a good idea since they decouple expensive computation from synchronous HTTP requests. But they do increase the communication cost of each aggregation job -- every aggregation job will now require at least two network round trips, or more precisely two round trips per aggregation step required by the VDAF. I think this means that implementations would want to tune for fewer, larger aggregation jobs to amortize this cost per-report.)
I do think asynchronous aggregation jobs lead to increased complexity in the "aggregation job state machine".
With synchronous aggregation jobs, the state machine goes from step 1 -> step 2 ... until the aggregation job is complete.
With asynchronous aggregation jobs, an additional "computing" step will be added to each step above, so that the state machine goes computing step 1 -> step 1 -> computing step 2 -> step 2 -> ...
This is not a fatal flaw, just something to consider when weighing synchronous vs asynchronous behaviors.
We can make both asynchronous & synchronous aggregations co-exist by doing something like this:
That sounds to me like it would work just fine. It's up to the Helper if they want to implement both or one or the other.
Tim's comment raises a good question: if we take this, should we allow both asynchronous & synchronous aggregation job behavior, or only asynchronous?
IMO, it would be nice if we could specify only one of asynchronous/synchronous behavior: requiring implementations to support both modes would be added complexity; specifying both modes but allowing implementations to only implement one mode could lead to interoperability problems between aggregator implementations. But maybe that is too hopeful?
(In general, I think asynchronous aggregation jobs are a good idea since they decouple expensive computation from synchronous HTTP requests. But they do increase the communication cost of each aggregation job -- every aggregation job will now require at least two network round trips, or more precisely two round trips per aggregation step required by the VDAF. I think this means that implementations would want to tune for fewer, larger aggregation jobs to amortize this cost per-report.)
As far as I am concerned, the additions to compute are pretty negligable compared with the additional scalability we would be getting.
Per 2024/6/12 DAP sync: @erks would be alright with making each step asynchronous and not allowing the synchronous option? We just had a call on this, and we expect that most of the time the Helper will always make this asynchronous.
I think, from the spec perspective, it's probably okay to move towards the asynchronous option exclusively. I'm just worried about the migration of the existing implementations to the new async model, as there could be periods where Leader and Helper have to support both at the same time.
As mentioned in #556, replay protection implies a relatively stringent operational requirement for the Helper: many aggregation jobs might make concurrent transactions on the same database, which can easily overwhelm the database if not sufficiently provisioned.
To mitigate this problem, the Helper can cancel an aggregation job and ask the Leader to try again later. One option is to respond with HTTP status 429 and a "retry-after" header indicating when it should be safe to retry.
I'd like to suggest that we spell this out explicitly in the draft.