Fix reservoir sampling - Githubissues

markwongsk commented 6 years ago

-- when the sample size has reached the reservoir size, the original implementation popped off the min element of the priority queue before inserting the updated element. Priority Sampling should retain the "k items with highest priority". The existence of the uniformly selected u_i from [0..1] makes it possible that the updated value actually has a lower priority than the existing k values, as demonstrated by the seeded test

markwongsk commented 6 years ago

Fixes #230

mihasya commented 6 years ago

I'm a little unclear on the motivation for these changes. I see the statement "Priority Sampling should retain the "k items with highest priority"." - is this coming from an academic definition of Priority Sampling?

The objective of the implementation is to weigh more recent data significantly more heavily than later data. Barring a very compelling correctness argument ("these numbers are completely wrong"), I'd be reticent to change the formula. Anyone upgrading would likely see at least a subtle change in their reported results, which would make time period comparisons more difficult.

markwongsk commented 6 years ago

@mihasya Hey, thanks for the comments and I apologize for missing this notification.

For the definition of Priority Sampling, I'm using the definition from the paper that's linked in the comments in the code. For clarity, this is the link: http://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf

On page 7 bottom left, "... and the algorithm retains the k items with highest priorities".

Note in particular, that the current implementation specifically does not adhere to this definition, since it always includes the latest element in the stream. The last element isn't always guaranteed to be in the sample stream because it may not be the one with the highest priority, as the u_is are randomly picked uniformly between [0..1] and there's a probability that w_i/u_i >= w_n/u_n, for all i < n and n is the nth element in the stream. Intuitively, a weighted sample stream should bias towards recent values, but should not always include the latest element.

Practically, the test I wrote "TestExpDecayDoesNotAlwaysIncludeLastItem" serves to demonstrate that the new implementation doesn't always include the latest element in the sample whenever a new element appears in the stream. Note that without my changes, this test actually fails. At the end of the day I'm not introducing anything new, just trying to fix an implementation flaw of the algorithm. Happy to go into more math if I need to.

seanjoe commented 6 years ago

Does it influence the memory in server when use the plugin for monitor the service? There is memory leak in my service, the go tool pprof output is like as follow:

(pprof) top Showing nodes accounting for 292.13MB, 91.35% of 319.79MB total Dropped 112 nodes (cum <= 1.60MB) Showing top 10 nodes out of 97 flat flat% sum% cum cum% 238.64MB 74.62% 74.62% 238.64MB 74.62% go-metrics.newExpDecaySampleHeap

the memory of it's occupy will grow little by little, up to 1G and more

rcrowley / go-metrics

Fix reservoir sampling #229