Adopting DDSketch as the default ValueRecorder aggregation

jmacd commented 4 years ago

Issue #636 describes a number of considerations surrounding the choice of a default aggregation for ValueRecorder instruments, where "Last Value", "Histogram", "MinMaxSumCount", "Sketch", and "Exact" are the choices, and prior draft specifications have stated MinMaxSumCount as the default, which corresponds with a Prometheus Summary (Sum, Count) containing the p0 and p100 percentiles (a.k.a. Min and Max)--this is a trivially mergeable summary, but not always the most desired format.

A leading option that has wide support is the use of DDSketch, which has a number of DataDog open-source implementations (e.g., Golang, Python, Java).

See the protocol definition used in the DataDog agent. This is (unfortunately) not commented. This issue proposes that we settle the open question in #636 as proposed, by specifying DDSketch as the default aggregation. To make this happen, ideally a contributor from DataDog would make a pull request to https://github.com/open-telemetry/opentelemetry-proto with a documented protocol (AI: @mikezvi find us contributor(s)?) for DDSketch aggregations.

Reviewing the current agent payload protocol used by DataDog suggests that the protocol makes assumptions about the configuration of the algorithm. Compare and contrast the implementation used by the DD agent:

https://github.com/DataDog/datadog-agent/blob/master/pkg/quantile/config.go

and the "clean" implementation here:

https://github.com/DataDog/sketches-go/tree/master/ddsketch

We see different default configurations. In the agent:

const (
    defaultBinLimit = 4096
    defaultEps      = 1.0 / 128.0
    defaultMin      = 1e-9
)

and in the separate implementation:

const (
    defaultMaxNumBins = 2048
    defaultAlpha      = 0.01
    defaultMinValue   = 1.0e-9
)

The paper states, " As an example, for α = 0.01, a sketch of size 2048 can handle values from 80 microseconds to 1 year, and cover all quantiles." I studied this for a while and had trouble understanding how to get from the settings above to the value of either 80 microseconds or 1 year--particularly note that the paper does not discuss the defaultMinValue setting and that the code contains a concept of "key offset" that is also not included in the paper. DataDog (@mikezvi) I would like your help clarifying some of these aspects and, ideally, publishing more exact pseudocode descriptions of the actual algorithm used. Simply stated: your pseudocode in the paper does not closely match your real code. For example, Algorithm 3 in the paper is under 10 lines of pseudocode and uses a linear scan, whereas it is actually implemented by helpers named growLeft and growRight (~80 lines of code). These helpers are not symmetric, not documented, and not described in the paper. I'd like to see more.

Assuming we can resolve these issues (which I'm confident we can), the DDSketch algorithm looks like a winner. I believe that the DD agent-payload protocol linked above is based on an assumption that the default values are fixed, not variable, as they are not included in the protocol. Based on this, a rough draft for the documented protocol might look like:

message DDSketchDataPoint {
        // Standard OTel Metics data point fields:
    //   Labels = 1
    //   StartTimeUnixNanos = 2
    //   TimeUnixNanos = 3

        // These four fields carry the MinMaxSumCount aggregation.

    // cnt is the number of points described in this sketch
    int64 cnt = 4;
    // min is the smallest value among points described in this sketch
    double min = 5;
    // max is the largest value among points described in this sketch
    double max = 6;
    // sum is the sum of value among points described in this sketch
    double sum = 7;

    // sketch configuration: AI(DataDog) write lengthy comments here including
    // the equations to map from `k` values below into bin boundaries.
    double minValue = 8;   // default is 1e-9
    int64 maxBins = 9;     // default is 2048
    int64 alpha = 10;      // default is 0.01

    // k are the key values used in the equations above to calculate boundaries
    // of the bins representing this sketch.  Note these are signed and use zigzag
    // encoding.
    repeated sint32 k = 7;

    // n are the counts associated with the bins whose boundaries are set by k.
    repeated int64 n = 8;

    // TODO(bogdandrutu) maybe argue for fixed64 instead of int64 above?
}

As you can see, I added three fields to express the configuration parameters used in the sketch. I have concerns about representing these values directly, though, since they do not clearly map into the minimum and maximum values expressible by the sketch (i.e., the connection between these parameters and the 80 microseconds and 1 year figures are not clear, and it's those figures that users actually care about).

Resolves #636. Closes https://github.com/open-telemetry/oteps/pull/117. Closes https://github.com/open-telemetry/oteps/pull/118.

tsloughter commented 4 years ago

Is there a C or Rust implementation?

I still wouldn't want it to be the default in Erlang/Elixir since it would require additional dependencies to build and a user may just want the SDK for tracing, but I'm not even sure there is an available implementation to use from Erlang/Elixir with native bindings.

jmacd commented 4 years ago

There is a Rust implementation derived from the DD-Go implementation: https://github.com/mheffner/rust-sketches-ddsketch

The OTel-C++ repository has an implementation by @ankit-bhargava that (I believe) is written from the paper, not derived from one of the DD implementations. This may highlight the point I made above, that the pseudocode in the paper is not equal to the DD implementations and it's not clear what differences that implies.

In the call today I mentioned that not all languages will have a DDSketch implementation. We could:

settle on recommending either DDSKetch (DataDog defaults: 2048, 0.01, 1e-9) or Histogram with fixed boundaries (Prometheus defaults: 10 buckets ranging from 0.005 to 10)
settle on recommending RAW values when DDSketch is not available. Presume that an OTel collector will be able to receive raw values and compute a DDSketch on behalf of the application.

CharlesMasson commented 4 years ago

One of the paper's authors here. I'll add some comments to the discussion to hopefully help the decision.

There are a few divergences between our implementations. The sketches-java implementation is the latest, most robust and well-tested one, and it is currently being used in production at Datadog. The sketches-py and sketches-go implementations, along with the agent implementation, came before the java one and we will be updating them in the near future.

Most notably there are divergences in the mappings between values and bucket indexes. However, we can reconcile all of them by using the following mapping from the Java implementation. Only two parameters are necessary: logGamma (or possibly gamma=exp(logGamma) itself) and indexOffset (not necessarily an integer). Note that we can get the sketch relative accuracy alpha from gamma: alpha = (gamma-1)/(gamma+1).

double gamma = 8;
double indexOffset = 9;

The other parameters mentioned above (minValue and maxBins) are not necessary to regenerate quantiles, and could be seen as parameters of the implementations of DDSketch, for instance to bound the memory size of the sketch. The only reason I see for which we may want to include such parameters is to know if the sketch has collapsed buckets, therefore if some returned quantiles do not meet the relative accuracy guarantee. We may want to consider more universal parameters though (e.g., the range of valid/non-collapsed values).

As for the bucket counts (which are basically an associative array that maps bucket indexes to the number of values that belong to the buckets), the encoding suggested above works (I believe a map would work as well).

If data size matters, I'd suggest adding to the protocol an additional way of encoding the counts with an offset and an "array", given that in practice most non-empty buckets are contiguous. That would allow storing the counts of contiguous non-empty buckets more efficiently, while also using the bucket-indexed encoding for the sparse non-empty buckets.

repeated sint32 bucketIndexes = 7;
repeated int64 bucketCounts = 8;
sint32 contiguousBucketIndexOffset = 9;
repeated int64 contiguousBucketCounts = 10;

As an example, 12 -> 3, 13 -> 4, 14 -> 2, 18 -> 1 would be optimally encoded as:

bucketIndexes = [18]
bucketCounts = [1]
contiguousBucketIndexOffset = 12
contiguousBucketCounts = [3, 4, 2]

Instead of the more costly alternative:

bucketIndexes = [12, 13, 14, 18]
bucketCounts = [3, 4, 2, 1]

For cases where a bucket would happen to be encoded both in the contiguous way (contiguousBucketCounts) and in the sparse way (bucketCounts), we can unambiguously consider that the final bucket count is the sum of both.

The above version allows keeping track of positive values. We also need a specific counter for 0, as well as the same set of four fields for negative values, if we also want to track them (see the signed version of DDSketch in Java):

DDSketchStore positiveValues = 7;
DDSketchStore negativeValues = 8;
int64 zeroCount = 9;

with:

message DDSketchStore {
    repeated sint32 bucketIndexes = 1;
    repeated int64 bucketCounts = 2;
    sint32 contiguousBucketIndexOffset = 3;
    repeated int64 contiguousBucketCounts = 4;
}

As for the count type, if all what we need is computing quantiles, note that we don't actually need integer precision, neither do we need 64-bit float precision, so I believe we could in theory go for float, which is smaller. That may cause some challenges though (e.g., all implementations use integer counts, so what if the float value is too large to be encoded as an integer?).

arminru commented 4 years ago

@jmacd Would it really make sense to have spec specify an aggregator as default for GA that has only been implemented in one of our SIGs so far (if I get the discussion right)? It looks like most SIGs would end up using the fallbacks you mentioned above rather than being aligned, right?

jmacd commented 4 years ago

@CharlesMasson thank you for the detailed explanation. To @arminru's point, in today's SIG I agreed to write up what it would take to make DDSketch the default before GA. We have to admit that this protocol is relatively complicated and that some SDKs will not be able to adopt it quickly.

There is a lot of enthusiasm for DDSketch in general, which has us leaning toward adopting it if the necessary code can be produced quickly, a question for @mikezvi if DD will commit these resources.

Using the protocol you described last for reference,

message DDSketchPoint {
   // 3 usual OTLP fields

  double gamma = 4;
  DDSketchBuckets positiveValues = 5;
  DDSketchBuckets negativeValues = 6;
  ????? zeroCount = 7;
}

message DDSketchBuckets {
  repeated sint32 bucketIndexes = 1;
  repeated ????? bucketCounts = 2;
  sint32 contiguousBucketIndexOffset = 3;
  repeated ?????? contiguousBucketCounts = 4;
}

I have a question or two the non-contiguous buckets and what we may or may not know about the relative accuracy guarantee, but I think I'll reread the paper again before asking them 😀. I wonder about the zero count, which is a zero-width bucket and whether it will create trouble for some implementations.

For your question about count types (?????? above), you're right that floating-point values can create trouble because a lot of histogram-receiving code expects integers. There have been discussions about representing counts as floating-point types elsewhere, since we have metrics formats with sample-rate parameters. I'd say it's normal to have non-integer counts. Presently we have fixed64 counts--I'm not sure there is an appetite for less precision.

OTLP in general does not (yet) have a way to represent sampled data counts, though. Issue https://github.com/open-telemetry/opentelemetry-proto/issues/73 discusses this topic (also https://github.com/open-telemetry/opentelemetry-proto/pull/159).

As for the list of steps required, here's what I came up with:

DDSketch protocol loosely described would have to have a very clear description, including all the math formulas relating gamma and bucket boundaries, and all of the sort of details you discussed above
A README.md document giving pseudo-code for the actual algorithm currently in the DDSketch-java implementation
DDSketch libraries for the five GA languages (@mtwo please list these again, notes from today's meeting aren't complete) that implement output these protos
Sketch Aggregators for the OTel metrics SDKs in the five GA languages
OTLP Exporter code support in five GA languages
OTel Collector shims for access to sketch data points
OTel Collector exporter(s): Prometheus remote write, maybe others.

Note: these are loosely ordered. 1 and 2 are prerequisites for 6 and 7, and lot of other SDK work has to happen before steps 4 and 5.

jmacd commented 4 years ago

Note that there are independent questions about this data point:

should it include the exact Minimum and Maximum values
should the top-level data point (redundantly) include the sum and count across buckets (note: OTLP's histogram data point does)

arminru commented 4 years ago

@jmacd Note that there are also well-established histograms like HdrHistogram which comes with a lot of implementations in different languages, some of which are already used in production (for example in Elastic) and might be a better option in terms of GA-readiness. I understand, however, that there are certain weaknesses with HdrHistogram that DDSketch addressed. Circllhist (C impl, JS impl) and UDDSketch (apparently an advancement of DDSketch, with a C impl) are some other alternatives that I came across. We are using a histogram implementation that also solves certain shortcomings of HdrHistogram (discussed here) called Dynamic Histogram but, just as DDSketch, only have a Java implementation used in production. You might want to take a look at how negative values are handled and how efficient storage and serialization is solved there, as it differs from other implementations from what I can tell. I'm not familiar with either of the options and I'm not a metrics or stats expert at all but, if desired, I could ask around internally if further information on DynaHist would be desired. For the scope of GA, however, HdrHistogram could be a viable compromise given the variety of existing implementations.

jmacd commented 4 years ago

@arminru Yes, it's starting to look like there are many variations on histograms that use variable boundaries and a compressible representation. This suggests we focus on more flexible ways to encode bucket boundaries, but that otherwise the proposed DDSketch data point is no different from a Histogram data point.

jmacd commented 4 years ago

I will propose I will open a new proposal along the lines noted above.

open-telemetry / opentelemetry-specification

Adopting DDSketch as the default ValueRecorder aggregation #919