phadej / tdigest

On-line accumulation of rank-based statistics such as quantiles and trimmed means
30 stars 7 forks source link

Is the memory supposed to increase linearly with the inserts? #21

Closed lorenzo closed 7 years ago

lorenzo commented 7 years ago

I have an application where I use this, and have observed that the memory usage increases linearly (slowly) as I insert more data in the t-digest.

Is this expected? I thought that by being an online stats calculator it would keep the memory constant

phadej commented 7 years ago

I added a size of TDigest structure to tdigest-simple tester output. With tdigest-simple -m tdigest -d normal -s <input size> +RTS -s

    input  size  runtime & maximum residency
  2000000  373
  4000000  388
  6000000  401
  8000000  413
 10000000  422
 20000000  439
 50000000  463
100000000  478
200000000  497   7min, 496,016 max res 
400000000  523  15min, 497,696 max res
800000000  536  31min, 498,536 max res (~400k samples a second)

this trend is not good (growing!), the limit is 1000 after which everything still works but is very slow. IIRC if one changes q * (1 - q), to sqrt (q * (1 - q)), then we'll trade cpu (and some precision?) for stability. But I remember seeing that from some presentation, not the paper.

FWIW the paper is changed from https://github.com/tdunning/t-digest/blob/41d3eb3c03947f0d1a04c66ecfc943d04ae239fb/docs/t-digest-paper/histo.pdf to https://github.com/tdunning/t-digest/blob/07b8f2ca2be8d0a9f04df2feadad5ddc1bb73c88/docs/t-digest-paper/histo.pdf and the algorithm has been tweaked considerably. Have to try new variants.

In meanwhile: If your app will ever estimate at max order of 10^8 samples during its lifetime, you should be ok. If more, let's talk more.

lorenzo commented 7 years ago

In meanwhile: If your app will ever estimate at max order of 10^8 samples

Not at all, usually around the 1 - 10 million entries, yet I can see the memory growing significantly. It hasn't been a problem at all so far, I was more wondering why I could see it linearly increase.

I'm using it for this project, BTW: https://github.com/seatgeek/wrecker

If you want to play with it do something like

wreck https://localhost:8080 --concurrency 100

and keep an eye on the memory consumption. I noticed t-digest was the main driver of the memory going up after I replaced the statistics recorder with a no-op.

In any case it is a non-issue for me, was actually just curious about the reason.

phadej commented 7 years ago

That link is broken, is it private repository?

Anyhow, nice to see tdigest been used in the wild.

lorenzo commented 7 years ago

Ah crap, the public one is https://github.com/lorenzo/wrecker

phadej commented 7 years ago

@lorenzo thanks, I'll try it out! (seems like a nice tool too)

lorenzo commented 7 years ago

Great, thanks! I'll close this ticket as it was more a question than an actual problem