Mov quantile - Githubissues

novertia commented 4 months ago

Fix #751

This PR adds the moving quantile algorithm based on the articles mentioned in the issue. It follows the quantile approximation strategy when the rolling window do not have enough data. This PR adds both the code logic under mov_quantile.h as well as the required unit test under test_mov_quantile.cpp

novertia commented 4 months ago

@gavv for calculating the index of heap to return (root index) of partition heap I am currently taking a percentile of size_t checking if its 0 -100 and then calculating the index (percentile * window_length)/100. I am not sure if this approach is okay or should I instead take percentile between 0-1 and calculate and type cast accordingly.

gavv commented 4 months ago

Thanks for PR! Me or @adrianopol will do review in the upcoming days.

Regarding 0..100 vs 0..1: I guess we should either name it MovPercentile and use 0..100 or name it MovQuantile and use 0..1. In both cases, we need to use float or double for this number.

novertia commented 4 months ago

If we use float or double for taking the percentile or quantile for index calculation at one point we will have to type cast window length from size_t to double or percentile to size_t since the index and window_length is in size_t. Should this approach be okay considering precision loss cases for the aforementioned type casting.

Can we take k instead of a percentile or quantile which can be the kth smallest element user want from the window length like in the document. Although not sure if this approach is in sync with the requirement.

gavv commented 4 months ago

Just to be sure I understand the question correctly. You're asking what pair of parameters to use, among these three alternatives, right?

window_size (L) + K in range [0; L)
window_size (L) + percentile (P) in range [0; 100) (then K = L * P / 100)
window_size (L) + quantile (Q) in range [0; 1) (then K = L * Q)

I think in code that will use MovPercentile/MovQuantile, the original parameters that we have would be window_size and percentile/quantile. E.g. we may need to compute 95th percentile of inter-arrival jitter for last 300 packets. We'll likely have numbers 300 and 95 (or 0.95) in config. We could compute K from those numbers, though it would be convenient if MovPercentile/MovQuantile would compute it for us.

As for rounding errors, I believe that if Q is 32-bit float in range [0; 1), then we can compute K precisely for any L in range [0; 2^23), because floats in range [0; 1) can be mapped one-to-one to integers in range [0; 2^23). If that's true, it looks more than enough and we don't have to bother about rounding errors.

I suggest to stick with MovQuantile approach with parameters L + Q, just to be sure we use floats or doubles in range [0; 1), since that range has highest precision (24 bits for 32-bit float, 48 bits for 64-bit float). If you think floats are not enough, let's use doubles.

novertia commented 4 months ago

Sure I understood will make the changes

gavv commented 3 months ago

Awesome, thanks a lot for PR and sorry for delay!

All suggestions I have are cosmetic, so I just committed them by myself: 4e6b41353abf0a43453d303032157746532253be (copyright year, make methods private, use const, and a few renames).

I also added a simple stress test with random window size and values: 0bc19da495a6217abeebaef27a597b6e1c2ebf9b, seems to work fine.

One small question about this snippet:

        if (win_filled_) {
            ...
            min_heap_index_ = win_len_ - 1;
            max_heap_index_ = 0;

Are there situations when min_heap_index_ and max_heap_index_ are not already set to these values in this branch?

novertia commented 3 months ago

At the very start when there are not enough values to fill the window at that point the values will be different. Both of them will start from heap root and max_heapindex will decrement and min_heapindex will increment. Once the window is filled then they will take on the mentioned values. This method will still give us quantile even if we don't have all the values in the window.

Thanks for merging the PR

gavv commented 3 months ago

Yeah, I mean, by the time when we enter if (win_filled_) { branch, min_heap_index_ and max_heap_index_ always already have values win_len_ - 1 and 0, right? And these assignments are redundant.

Or am I missing something, and and it's possible that win_filled_ is true, but those two variables have different values?

novertia commented 3 months ago

That is correct and the assignments are redundant. Sorry i missed that.

novertia commented 3 months ago

when winfilled that values max_heapindex will be 0 and min_heapindex is win_len - 1 there can't be another case.

gavv commented 3 months ago

Got it, thanks. No worries, I just wanted to ensure I'm not missing something important.

roc-streaming / roc-toolkit

Mov quantile #761