ietf-wg-ppm / draft-ietf-ppm-dap

This document describes the Distributed Aggregation Protocol (DAP) being developed by the PPM working group at IETF.
Other
41 stars 20 forks source link

Potential simplification of current-batch collection request semantics #526

Closed branlwyd closed 5 months ago

branlwyd commented 6 months ago

In the fixed-size query type, current-batch collection requests allow the Leader to choose an outstanding batch to associate with the request.

The current semantics are that the Leader can associate the same batch to multiple current-batch collection requests (DAP-07 4.1.2):

The Collector may not know which batch ID it is interested in; in this case, it can also issue a query of type current_batch, which allows the Leader to select a recent batch to aggregate. The Leader SHOULD select a batch which has not yet began collection.

The reason these semantics were chosen was to avoid data loss in the case that a Collector issued a collection request, then crashed before recording the collection job ID. Janus, for example, can associate the same batch to an arbitrary number of current-batch collection requests, until at least one of those collection requests is polled for the first time.

===

However, since the above semantics were chosen, DAP changed such that the Collector now determines the collection job ID itself (as part of the changes for the resource-oriented API). This means that we could similarly avoid data loss if we expect the Collector to durably store the collection job ID before making the collection request for that ID, and lean on either idempotency or appropriate error codes to allow recovery in the face of process failure. This would allow aggregator implementations to associate each batch to exactly one current-batch request.

The upside of making this change is that Collectors would no longer need to deduplicate current-batch requests that happen to be mapped to the same batch. Along with the simplified aggregator semantics, I think this is pretty likely to be an overall complexity win.

branlwyd commented 6 months ago

Thinking about what this would look in terms of textual change to the spec: I think we would just change "The Leader SHOULD select a batch which has not yet began collection." from a SHOULD to a MUST.

We could optionally choose to provide an implementation advice that the Collector note the collection job ID to durable storage before sending a collection request to avoid data loss.

cjpatton commented 6 months ago

SGTM, please send a PR :)