tdunning / t-digest

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means
Apache License 2.0
1.97k stars 226 forks source link

Definition of rank (CDF) #103

Open AlexanderSaydakov opened 6 years ago

AlexanderSaydakov commented 6 years ago

In the TDigest class I see that cdf(x) described as follows: returns the fraction of all points added which are <= x.

So if one value is added, its rank must be 1. I am using TestNG:

  @Test
  public void oneValue() {
    TDigest t = TDigest.createDigest(100);
    t.add(0);
    Assert.assertEquals(t.cdf(0), 1.0); // java.lang.AssertionError: expected [1.0] but found [0.5]
  }

  @Test
  public void twoValues() {
    TDigest t = TDigest.createDigest(100);
    t.add(0);
    t.add(1);
    Assert.assertEquals(t.cdf(0), 0.5); // java.lang.AssertionError: expected [0.5] but found [0.0]
    Assert.assertEquals(t.cdf(1), 1.0); // this is correct
  }
tdunning commented 3 years ago

Thank you for spotting this.

I believe that this was largely bad documentation in the javadoc. This also came up in a recent (not friendly) evaluation on the Apache DataSketches documentation.

The intention was to use a mid-point rule definition. In this definition, cdf is the number of data points <x plus half the number ==x. In the latest code, there is a version of Dist.cdf that allows you to set this value from 0 to 1.

I am closing this now in anticipation of the 3.3 release, but will re-open if you think there is still a problem.

leerho commented 3 years ago

@tdunning We would like to see this (not friendly) evaluation of the Apache DataSketches documentation. Criticisms are welcome, we always want to improve our documentation.

Thanks, Lee.

tdunning commented 3 years ago

Lee,

Glad to hear from you.

See https://datasketches.apache.org/docs/QuantilesStudies/KllSketchVsTDigest.html

On that page, there are a number of inaccuracies that could have been repaired with a bit of dialog or even just a bit of googling. For instance, at the beginning there is this:

t-digest code as of Sep 14, 2017 (there is no published jar)

In fact, as a quick search shows, the first published jar was from 2014 and there are a number of releases between that and 2017.

Next, there is the comment:

It is not clear what rule t-digest uses.

IN fact, there is extensive documentation of the interpolation scheme used. See, for instance, this diagram which illustrates how non-singleton centroids are handled. Or this figure which illustrates how singletons and end-points are handled (and which clearly implies a mid-point rule). In addition, the test cases use Dist.quantile() and Dist.cdf() as a reference and both of these make clear that a mid-point rule is used.

Next, when accuracy is examined, it appears that only the median is considered. This makes sense from the point of view of the KLL sketch since performance is so abysmal near the tails (by design, the KLL guarantees absolute, not relative error). But even so, the comparison appears to be subject to errors that can be mitigated by dithering the number of samples. Even if the comparison were updated to use the REQ sketch, the relative performance is still poor on average and the size is unbounded.

The accuracy comparison also exhibits clear problems. For a value of δ = 200, the t-digest will always retain at least 100 samples and thus will show zero error for cdf() and quantile(). The test case on the comparison page is clearly buggy. Here is a corrected version of those accuracy slides (just t-digest) that shows the error in estimating the quantile of a uniform distribution as measured over 50 runs. As expected, error is zero until over a hundred samples. The error is large (but much smaller than the comparison page shows) at small counts because of the transition from keeping all of the samples versus keeping a centroid of two samples. That transition makes it very hard to estimate quantiles accurately. Once the centroids in question have roughly 100 samples, the interpolation becomes very, very accurate and median estimation becomes more accurate.

As I mentioned, the comparison only looks at the median. If you shift to a more extreme quantile, however, the t-digest performance improves even more. For the 0.001 quantile (or the equivalent 0.999 quantile) there is zero error all the way up to 10,000 samples and really tiny errors beyond that.

So, yeah, that comparison page was pretty poorly done and I would have been happy to help make it better.

AlexanderSaydakov commented 3 years ago

a number of inaccuracies that could have been repaired with a bit of dialog

This issue was opened (more than three years ago!) precisely to have that bit of dialog. I am glad that finally it is happening. We are happy to correct our approach. Based on our testing, assuming mid-point rule did not seem to yield expected results, as described and shown on the first graph.

tdunning commented 3 years ago

On Wed, Apr 7, 2021 at 3:55 PM Alexander Saydakov @.***> wrote:

a number of inaccuracies that could have been repaired with a bit of dialog

This issue was opened (more than two years ago!) precisely to have that bit of dialog. I am glad that finally it is happening. We are happy to correct our approach. Based on our testing, assuming mid-point rule did not seem to yield expected results, as described and shown on the first graph.

Assuming that somebody in the world would see a random JIRA on a project that they don't know about isn't much of a way to encourage dialog. My email is well known and available everywhere and back in that time frame, I was very well known in Apache. You could have reached out very easily.

As is very clear, your testing was buggy as shown by my results that I posted earlier today.

See com.tdunning.tdigest.quality.CompareKllTest#testMedianErrorVersusScale in the latest head version of t-digest for the code that produced these graphs.

AlexanderSaydakov commented 3 years ago

Assuming that somebody in the world would see a random JIRA on a project that they don't know about isn't much of a way to encourage dialog.

I am not sure I understand. What random JIRA? What way could be more direct than opening an issue in your personal repo with the code in question?

We will take a look at the details you posted. Thanks.

tdunning commented 3 years ago

OK. Sorry about the confusion there. I didn't understand that you were referring to this issue (even though that is exactly what you said ... my bad). I had thought that you were referring to some issue raised in the DataSketches project.

But I have to say that the following questions were not raised here:

tdunning commented 3 years ago

I would add that while I didn't respond well to this issue, but the test com.tdunning.math.stats.TDigestTest#singleSingleRange addresses this same issue.

   public void singleSingleRange() {
        TDigest digest = factory(100).create();
        digest.add(1);
        digest.add(2);
        digest.add(3);

        // verify the cdf is a step between singletons
        assertEquals(0.5 / 3.0, digest.cdf(1), 0);
        assertEquals(1 / 3.0, digest.cdf(1 + 1e-10), 0);
        assertEquals(1 / 3.0, digest.cdf(2 - 1e-10), 0);
        assertEquals(1.5 / 3.0, digest.cdf(2), 0);
        assertEquals(2 / 3.0, digest.cdf(2 + 1e-10), 0);
        assertEquals(2 / 3.0, digest.cdf(3 - 1e-10), 0);
        assertEquals(2.5 / 3.0, digest.cdf(3), 0);
        assertEquals(1.0, digest.cdf(3 + 1e-10), 0);
    }

Again, to give this trouble report the credit it deserves, I only added this test substantially after this issue was first raised and t-digest would have been better had I (or anybody) dug into the issue sooner. I (we) could have done significantly better by responding in a more timely and attentive fashion.

AlexanderSaydakov commented 3 years ago

Next, when accuracy is examined, it appears that only the median is considered

This is not so. We consider all ranks. The testing method is as follows:

  1. for each trial generate a set of input values
  2. feed the input values to the sketch
  3. sort the input values to compute true ranks
  4. query the sketch with each value to get estimated rank
  5. keep max rank error per trial
  6. at the end of all trials get 99th percentile of max error
  7. plot that 99pct rank error (y axis) vs input size (x axis)

The code for these measurements is published. For t-digest we measured using all 3 rules (min, mid and max). So just ignore min and max if mid-point rule is true. We are open to criticism of this method. We measured the latest code at that time.

tdunning commented 3 years ago

This is not so. We consider all ranks.

Since the error is largest at the median for the t-digest, this is equivalent to considering only the median.

Where an algorithm specifically targets lower error at certain quantiles, stratifying by quantile is a more direct comparison. For instance, in my previous response, I gave separate graphs for q=0.5 and for q=0.001.

This is very important in practice because it is very common for a distribution sketch to be used to estimate tails. In terms of number of samples, this prioritization for tail accuracy makes a difference of 100x in the number of samples for which perfect accuracy is retained (for the t-digest) and even where perfect accuracy is lost, the errors are reduced by many orders of magnitude.

AlexanderSaydakov commented 3 years ago

benchmarks that show about 50 ns amortized insertion times

At what input size? I am not sure if I understand your point here. Do you mean that you disagree with our speed measurements?

We measure like this:

  1. step through input sizes AKA steam length (say, from 1 to 10M)
  2. run many trials at each input size (usually decreasing - more trials at small input size)
  3. update sketch with all input values, measure the whole time
  4. keep total update time across all trials
  5. divide by input size and number of trials to get an average time of one update at this input size
  6. plot this average time of one update (y axis) vs input size (x axis) The plot was published on that page you linked for t-digest 100 and 200. The characterization code was published as well so that anyone could reproduce the measurements, review and criticize the method.
AlexanderSaydakov commented 3 years ago

Since the error is largest at the median for the t-digest, this is equivalent to considering only the median.

I think I see your objection. Perhaps we ought to reconsider our method with respect to accuracy measurements.

tdunning commented 3 years ago

Perhaps we ought to reconsider our method with respect to accuracy measurements.

I think that your method (max absolute error) is absolutely fine for evaluating the KLL by itself. But if you compare against a different system (such as REQ or t-digest), then that different system probably has different design goals and thus a different evaluation is probably necessary to highlight what the differences actually are and to determine whether they are important in any given application.

It is also probably reasonable to commit to updating comparisons at least as often as your own performance changes.

Specifically, KLL optimizes for worst-case absolute error. That's fine. The t-digest optimizes for bounded and relatively small size, no dynamic allocation while optimizing tail accuracy. REQ gives up bounded size and dynamic allocation but focusses on guarantees relative to tail accuracy.

It is fine to talk about absolute error. But if you are comparing these three systems, it is much better to also examine how errors look near the tails. It would be good to look at memory consumption and allocation. It would also be good to examine progressively more perverse data sources to highlight how the heuristic interpolation of t-digest begins to fail for data with bizarre characteristics, but REQ is able to maintain its guaranteed accuracy.

The user can then decide whether memory size or tail accuracy or crazy data distributions are important to them.

The recent paper by Cormode and others, for instance, used a data distribution which might come up if you had a system with latencies that range from several times shorter than the time light takes to transit a proton all they way out to many orders of magnitude longer than the life of the universe so far. IF your data involves that level of skew, then the t-digest would be a bad choice. On the other hand, if you want small memory size and only care about 10 orders of magnitude of skew and a long track record of use, you might make the opposite choice.

tdunning commented 3 years ago

benchmarks that show about 50 ns amortized insertion times

At what input size? I am not sure if I understand your point here. Do you mean that you disagree with our speed measurements?

We measure like this:

...

  1. run many trials at each input size (usually decreasing - more trials at small input size) ...
  2. keep total update time across all trials
  3. divide by input size and number of trials to get an average time of one update at this input size

So this is a classic sort of benchmarking approach which is unable to disentangle the benchmark from the thing being benchmarked.

See here for more thoughts on this:

https://medium.com/97-things/benchmarking-is-hard-jmh-helps-476a68a51167

AlexanderSaydakov commented 3 years ago

I don't see any basis for this statement that our benchmarking is flawed. We carefully construct our measurements. We can see fine details of behavior of particular algorithms. For instance, we see spikes at some expected transition points. We see effects of some improvements. Our measurements in Java are quite consistent with similar measurements in C++.

leerho commented 3 years ago

JMH is useful for performing benchmarks on relatively small chunks of code, but is slow and cumbersome on larger more complex code. The scores JMH reports are not useful for estimating actual speed performance at a system level. The JMH scores are really only useful for comparison.

Also, many of our sketches are stochastic in behavior, and to reduce the noise produced by the inherent randomness, we often have to run many millions of trials, which can take a long time. Some of our studies can run for hours (and sometimes days!). Because JMH is so slow doing this kind of analysis is totally impractical.

Most of our tests are what we call characterization tests, which is much more than just benchmarking. We are looking to understand many different aspects of behavior and not just a single speed score. From our data and graphs, we can identify important state transitions of the sketch that JMH would never reveal and examine speed, accuracy and space consumed as a function of stream size and other dimensions.

In other words, we are not doing "classic benchmarking". Because we are running millions of trials over many different stream-lengths, warm-up and HotSpot compilation are no longer an issue because we are doing measurements long after those processes are completed. We only perform measurements on a single thread, so multithreading is not an issue. When necessary, we will perform our measurements on a quiet machine (no internet interrupts, etc.) and also do profiling to make sure that GCs are not distorting the measurements. As a result many of the characterization tests we have performed over the past few years have been very consistent and repeatable with very small variance.

I agree with the article that benchmarking is hard -- and JMH helps in certain types of comparisons. But JMH is by no means the only way to do speed analysis. And JMH is not at all useful for accuracy, error distributions, space, state transitions and other types of characterizations.

Cheers, Lee.

On Mon, Apr 12, 2021 at 3:52 PM Alexander Saydakov @.***> wrote:

I don't see any basis for this statement that our benchmarking is flawed. We carefully construct our measurements. We can see fine details of behavior of particular algorithms. For instance, we see spikes at some expected transition points. We see effects of some improvements. Our measurements in Java are quite consistent with similar measurements in C++.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tdunning/t-digest/issues/103#issuecomment-818294214, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCXRQTJ5FBGPIAXECYMYDLTIN2SNANCNFSM4ESXF2WQ .

AlexanderSaydakov commented 3 years ago
tdigest 200 update time

Here is a fresh measurement of update time of t-digest with compression=200 Uniform random input (Random.nextFloat)

tdunning commented 3 years ago

That is much closer to what I saw in my measurements.

AlexanderSaydakov commented 3 years ago

Much closer than what? This is quite consistent with that report on our web site 3 years ago. It was a bit higher - around 120 ns, but my laptop now is faster, Java made some progress, and perhaps some other factors.

AlexanderSaydakov commented 3 years ago

And your code must have changed too. I don't see that problem with rank we have seen 3 years ago. And also, if you look closer at these update time plots, the transition to some more expensive regime happened around 1000 updates before, but the latest code makes this transition at 2000 updates.

AlexanderSaydakov commented 3 years ago

@tdunning is this change of the transition point from 1000 to 2000 expected?

tdunning commented 3 years ago

It isn't surprising that this changed.

I did a fair bit of work adjusting the buffering strategy over the last few years. In particular, the strategy was changed to use a single unified buffer in the merge step which roughly doubled the scale which seems to correlate to what you saw.

On Mon, Apr 19, 2021 at 11:49 AM Alexander Saydakov < @.***> wrote:

@tdunning https://github.com/tdunning is this change of the transition point from 1000 to 2000 expected?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdunning/t-digest/issues/103#issuecomment-822696223, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6SPRZIH7GO4V5NY7DLTJR3LLANCNFSM4ESXF2WQ .