Reservoir sampler - Githubissues

thorfour commented 2 months ago

Adds a new operator to the query engine to perform random sampling of the rows.

Initially this was implemented in the physical conversion layer but it required us to push down the filter into the physical layer which created a significant amount of overhead as filtering on parquet values is slow. That approach has been abandoned for now but can be revisited in the future.

asubiotto commented 2 months ago

I think we had agreed to not enforce determinism across query runs since it seems difficult to do and we're unsure it's a hard requirement currently: https://docs.google.com/document/d/1t26jWIT7kvmgyzfQld_KtXzrSw6BJ76cxrU2qIPryD8/edit#heading=h.lb631ysiorah

brancz commented 2 months ago

So while there are other ways to achieve this, the requirements in Parca/Polar Siganls Cloud that must continue to hold true are:

1) sharing a link must result in users seeing exactly the same result 2) data across visualizations must be consistent (eg. viewing data in an icicle graph must show the same values as tables) 3) the binary names returned from the new metadata call must be consistent with what's shown in the icicle graph

The only way I can see this continuing to be true would be if we cache results from a query indefinitely, and only render reports based on the cached result.

thorfour commented 2 months ago

thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name                old time/op    new time/op    delta
Aggregation/sum-10     392µs ± 0%     390µs ± 0%  -0.40%  (p=0.000 n=9+9)

name                old alloc/op   new alloc/op   delta
Aggregation/sum-10     402kB ± 0%     403kB ± 0%  +0.27%  (p=0.000 n=8+9)

name                old allocs/op  new allocs/op  delta
Aggregation/sum-10       688 ± 0%       707 ± 0%  +2.72%  (p=0.000 n=10+10)

Aggregation benchmarks when sampling with a reservoir that is > total rows. So there should be little impact to queries that don't need to be sampled.

thorfour commented 2 months ago

Latest benchmarks with the optimization commits; the aggregate sum test where a sampler is added that exceeds the number of samples (i.e logical no-op)

thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name                old time/op    new time/op    delta
Aggregation/sum-10     392µs ± 0%     395µs ± 1%  +0.91%  (p=0.000 n=9+8)

name                old alloc/op   new alloc/op   delta
Aggregation/sum-10     402kB ± 0%     403kB ± 0%  +0.27%  (p=0.000 n=8+9)

name                old allocs/op  new allocs/op  delta
Aggregation/sum-10       688 ± 0%       704 ± 0%  +2.40%  (p=0.000 n=10+10)

Same benchmark but now sampling half of the samples. Obviously there's overhead but it's a lot less than it was before the optimizations

thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name                old time/op    new time/op    delta
Aggregation/sum-10     392µs ± 0%     898µs ± 0%   +129.34%  (p=0.000 n=9+7)

name                old alloc/op   new alloc/op   delta
Aggregation/sum-10     402kB ± 0%     736kB ± 0%    +83.08%  (p=0.000 n=8+10)

name                old allocs/op  new allocs/op  delta
Aggregation/sum-10       688 ± 0%     13731 ± 0%  +1895.84%  (p=0.000 n=10+10)

Before optimizations is below:

thor@thors-MacBook-Pro ~/.../github.com/polarsignals/frostdb % benchstat before.txt after.txt
name                old time/op    new time/op    delta
Aggregation/sum-10     392µs ± 0%    3839µs ± 1%   +880.17%  (p=0.000 n=9+10)

name                old alloc/op   new alloc/op   delta
Aggregation/sum-10     402kB ± 0%    2875kB ± 0%   +615.38%  (p=0.000 n=8+10)

name                old allocs/op  new allocs/op  delta
Aggregation/sum-10       688 ± 0%     35264 ± 0%  +5025.61%  (p=0.000 n=10+10)

polarsignals / frostdb

Reservoir sampler #884