Closed jmacd closed 4 years ago
Is there a C or Rust implementation?
I still wouldn't want it to be the default in Erlang/Elixir since it would require additional dependencies to build and a user may just want the SDK for tracing, but I'm not even sure there is an available implementation to use from Erlang/Elixir with native bindings.
There is a Rust implementation derived from the DD-Go implementation: https://github.com/mheffner/rust-sketches-ddsketch
The OTel-C++ repository has an implementation by @ankit-bhargava that (I believe) is written from the paper, not derived from one of the DD implementations. This may highlight the point I made above, that the pseudocode in the paper is not equal to the DD implementations and it's not clear what differences that implies.
In the call today I mentioned that not all languages will have a DDSketch implementation. We could:
One of the paper's authors here. I'll add some comments to the discussion to hopefully help the decision.
There are a few divergences between our implementations. The sketches-java
implementation is the latest, most robust and well-tested one, and it is currently being used in production at Datadog. The sketches-py
and sketches-go
implementations, along with the agent implementation, came before the java one and we will be updating them in the near future.
Most notably there are divergences in the mappings between values and bucket indexes. However, we can reconcile all of them by using the following mapping from the Java implementation. Only two parameters are necessary: logGamma
(or possibly gamma=exp(logGamma)
itself) and indexOffset
(not necessarily an integer). Note that we can get the sketch relative accuracy alpha
from gamma
: alpha = (gamma-1)/(gamma+1)
.
double gamma = 8;
double indexOffset = 9;
The other parameters mentioned above (minValue
and maxBins
) are not necessary to regenerate quantiles, and could be seen as parameters of the implementations of DDSketch, for instance to bound the memory size of the sketch. The only reason I see for which we may want to include such parameters is to know if the sketch has collapsed buckets, therefore if some returned quantiles do not meet the relative accuracy guarantee. We may want to consider more universal parameters though (e.g., the range of valid/non-collapsed values).
As for the bucket counts (which are basically an associative array that maps bucket indexes to the number of values that belong to the buckets), the encoding suggested above works (I believe a map
would work as well).
If data size matters, I'd suggest adding to the protocol an additional way of encoding the counts with an offset and an "array", given that in practice most non-empty buckets are contiguous. That would allow storing the counts of contiguous non-empty buckets more efficiently, while also using the bucket-indexed encoding for the sparse non-empty buckets.
repeated sint32 bucketIndexes = 7;
repeated int64 bucketCounts = 8;
sint32 contiguousBucketIndexOffset = 9;
repeated int64 contiguousBucketCounts = 10;
As an example, 12 -> 3, 13 -> 4, 14 -> 2, 18 -> 1
would be optimally encoded as:
bucketIndexes = [18]
bucketCounts = [1]
contiguousBucketIndexOffset = 12
contiguousBucketCounts = [3, 4, 2]
Instead of the more costly alternative:
bucketIndexes = [12, 13, 14, 18]
bucketCounts = [3, 4, 2, 1]
For cases where a bucket would happen to be encoded both in the contiguous way (contiguousBucketCounts
) and in the sparse way (bucketCounts
), we can unambiguously consider that the final bucket count is the sum of both.
The above version allows keeping track of positive values. We also need a specific counter for 0, as well as the same set of four fields for negative values, if we also want to track them (see the signed version of DDSketch in Java):
DDSketchStore positiveValues = 7;
DDSketchStore negativeValues = 8;
int64 zeroCount = 9;
with:
message DDSketchStore {
repeated sint32 bucketIndexes = 1;
repeated int64 bucketCounts = 2;
sint32 contiguousBucketIndexOffset = 3;
repeated int64 contiguousBucketCounts = 4;
}
As for the count type, if all what we need is computing quantiles, note that we don't actually need integer precision, neither do we need 64-bit float precision, so I believe we could in theory go for float
, which is smaller. That may cause some challenges though (e.g., all implementations use integer counts, so what if the float
value is too large to be encoded as an integer?).
@jmacd Would it really make sense to have spec specify an aggregator as default for GA that has only been implemented in one of our SIGs so far (if I get the discussion right)? It looks like most SIGs would end up using the fallbacks you mentioned above rather than being aligned, right?
@CharlesMasson thank you for the detailed explanation. To @arminru's point, in today's SIG I agreed to write up what it would take to make DDSketch the default before GA. We have to admit that this protocol is relatively complicated and that some SDKs will not be able to adopt it quickly.
There is a lot of enthusiasm for DDSketch in general, which has us leaning toward adopting it if the necessary code can be produced quickly, a question for @mikezvi if DD will commit these resources.
Using the protocol you described last for reference,
message DDSketchPoint {
// 3 usual OTLP fields
double gamma = 4;
DDSketchBuckets positiveValues = 5;
DDSketchBuckets negativeValues = 6;
????? zeroCount = 7;
}
message DDSketchBuckets {
repeated sint32 bucketIndexes = 1;
repeated ????? bucketCounts = 2;
sint32 contiguousBucketIndexOffset = 3;
repeated ?????? contiguousBucketCounts = 4;
}
I have a question or two the non-contiguous buckets and what we may or may not know about the relative accuracy guarantee, but I think I'll reread the paper again before asking them 😀. I wonder about the zero count, which is a zero-width bucket and whether it will create trouble for some implementations.
For your question about count
types (?????? above), you're right that floating-point values can create trouble because a lot of histogram-receiving code expects integers. There have been discussions about representing counts as floating-point types elsewhere, since we have metrics formats with sample-rate parameters. I'd say it's normal to have non-integer counts. Presently we have fixed64
counts--I'm not sure there is an appetite for less precision.
OTLP in general does not (yet) have a way to represent sampled data counts, though. Issue https://github.com/open-telemetry/opentelemetry-proto/issues/73 discusses this topic (also https://github.com/open-telemetry/opentelemetry-proto/pull/159).
As for the list of steps required, here's what I came up with:
Note: these are loosely ordered. 1 and 2 are prerequisites for 6 and 7, and lot of other SDK work has to happen before steps 4 and 5.
Note that there are independent questions about this data point:
@jmacd Note that there are also well-established histograms like HdrHistogram which comes with a lot of implementations in different languages, some of which are already used in production (for example in Elastic) and might be a better option in terms of GA-readiness. I understand, however, that there are certain weaknesses with HdrHistogram that DDSketch addressed. Circllhist (C impl, JS impl) and UDDSketch (apparently an advancement of DDSketch, with a C impl) are some other alternatives that I came across. We are using a histogram implementation that also solves certain shortcomings of HdrHistogram (discussed here) called Dynamic Histogram but, just as DDSketch, only have a Java implementation used in production. You might want to take a look at how negative values are handled and how efficient storage and serialization is solved there, as it differs from other implementations from what I can tell. I'm not familiar with either of the options and I'm not a metrics or stats expert at all but, if desired, I could ask around internally if further information on DynaHist would be desired. For the scope of GA, however, HdrHistogram could be a viable compromise given the variety of existing implementations.
@arminru Yes, it's starting to look like there are many variations on histograms that use variable boundaries and a compressible representation. This suggests we focus on more flexible ways to encode bucket boundaries, but that otherwise the proposed DDSketch data point is no different from a Histogram data point.
I will propose I will open a new proposal along the lines noted above.
Issue #636 describes a number of considerations surrounding the choice of a default aggregation for
ValueRecorder
instruments, where "Last Value", "Histogram", "MinMaxSumCount", "Sketch", and "Exact" are the choices, and prior draft specifications have stated MinMaxSumCount as the default, which corresponds with a Prometheus Summary (Sum, Count) containing the p0 and p100 percentiles (a.k.a. Min and Max)--this is a trivially mergeable summary, but not always the most desired format.A leading option that has wide support is the use of DDSketch, which has a number of DataDog open-source implementations (e.g., Golang, Python, Java).
See the protocol definition used in the DataDog agent. This is (unfortunately) not commented. This issue proposes that we settle the open question in #636 as proposed, by specifying DDSketch as the default aggregation. To make this happen, ideally a contributor from DataDog would make a pull request to https://github.com/open-telemetry/opentelemetry-proto with a documented protocol (AI: @mikezvi find us contributor(s)?) for DDSketch aggregations.
Reviewing the current agent payload protocol used by DataDog suggests that the protocol makes assumptions about the configuration of the algorithm. Compare and contrast the implementation used by the DD agent:
https://github.com/DataDog/datadog-agent/blob/master/pkg/quantile/config.go
and the "clean" implementation here:
https://github.com/DataDog/sketches-go/tree/master/ddsketch
We see different default configurations. In the agent:
and in the separate implementation:
The paper states, " As an example, for α = 0.01, a sketch of size 2048 can handle values from 80 microseconds to 1 year, and cover all quantiles." I studied this for a while and had trouble understanding how to get from the settings above to the value of either 80 microseconds or 1 year--particularly note that the paper does not discuss the
defaultMinValue
setting and that the code contains a concept of "key offset" that is also not included in the paper. DataDog (@mikezvi) I would like your help clarifying some of these aspects and, ideally, publishing more exact pseudocode descriptions of the actual algorithm used. Simply stated: your pseudocode in the paper does not closely match your real code. For example, Algorithm 3 in the paper is under 10 lines of pseudocode and uses a linear scan, whereas it is actually implemented by helpers namedgrowLeft
andgrowRight
(~80 lines of code). These helpers are not symmetric, not documented, and not described in the paper. I'd like to see more.Assuming we can resolve these issues (which I'm confident we can), the DDSketch algorithm looks like a winner. I believe that the DD agent-payload protocol linked above is based on an assumption that the default values are fixed, not variable, as they are not included in the protocol. Based on this, a rough draft for the documented protocol might look like:
As you can see, I added three fields to express the configuration parameters used in the sketch. I have concerns about representing these values directly, though, since they do not clearly map into the minimum and maximum values expressible by the sketch (i.e., the connection between these parameters and the 80 microseconds and 1 year figures are not clear, and it's those figures that users actually care about).
Resolves #636. Closes https://github.com/open-telemetry/oteps/pull/117. Closes https://github.com/open-telemetry/oteps/pull/118.