Closed JLockerman closed 2 years ago
Which order do we want arguments to be in? Right now we have
tdigest(buckets, value)
uddsketch(buckets, error, value)
another option is
tdigest(value, buckets)
uddsketch(value, buckets, error)
this would allow us to default some of the trailing arguments if we want
Right now we have a function-per-algorithm, ie.
tdigest(buckets, value)
uddsketch(buckets, error, value)
another option is to make the algorithm an argument
approximate_quantile(value, 'tdigest', buckets)
approximate_quantile(value, 'uddsketch', buckets, error)
(note that for this version all approximate quantiles have to return the same type, currently they don't because the different sketches offer different capabilites)
or
approximate_quantile(value, 'tdigest', '{"buckets": 100}')
approximate_quantile(value, 'uddsketch', '{"buckets": 100, "error":0.10}')
(note that for this version all approximate quantiles have to return the same type)
or
approximate_quantile(value, NULL::tdigest, buckets)
approximate_quantile(value, NULL::uddsketch, buckets, error)
approximate_quantile(value, NULL::tdigest, buckets)
is ugly enough that we don't think it's tenable.
It's not obvious if the unified API makes up for the type confusion in the other alternatives.
For functions with a sensible default, it may pay to alias the the algorithm under a more general name. eg. have
approximate_quantile(...)
as an alias for
tdigest(...)
It may be useful to have names describe what the function does in addition to what the algorithm is. There are no very good examples in this release, the only one to consider is tdigest_sketch(...)
for tdigest(...)
, a better example is lttb_downsample(...)
instead of lttb(...)
.
For the approximate quantiles, they could be called udd_approximate_quantile(...)
and tdigest_approximate_quantile(...)
We could possibly have both long and short names, eg. tdigest_approximate_quantile(...)
and tdigest(...)
We could also name based on how it's supposed to work, eg. approximate_quantile(...)
for tdigest and approximate_quantile_with_error(...)
for uddsketch
cc @avthars, @ryanbooz, @mfreed, @cevian. You might be interested in this.
can we change the uddsketch disk layout to be
UddSketch {
alpha: f64,
max_buckets: u32,
num_buckets: u32,
compactions: u64,
count: u64,
sum: f64,
first_non_nengative_key: u32,
key_is_zero: bool,
padding: [u8; 3],
keys: [i64; self.num_buckets],
counts: [u64; self.num_buckets],
}
it should save about 8 bytes per bucket
Downsampling is much easier to unify as downsample('lttb', value)
since all downsampling returns the same type.
the get_foo
(get_count
, get_min
etc) functions for each of these:
a) the name isn't the best
b) in the current documentation, we refer to them in headers/links as tdigest_foo
or uddsketch_foo
, which is confusing.
We're thinking we want to add a generic percentile_approx
or similar function before stabilization, at the very least a docs page that gets people pointed in the right direction and gives people where to start if they want to just use an approximate percentile algo and want something that will "do the right thing"
API discussion is now moved to Discussions #113
@WireBaron is currently fuzzing t-digest (PR #118), and uddsketch.
Analysis of the t-digest on-disk layout found nothing of note.
Analysis of the uddsketch on-disk layout found a potential 10x space improvement (from 2KB to around 300B) by delta-encoding the buckets and storing them in varints. This is a large enough improvement that we will likely change the layout to this before stabilization.
Discussion is happening in Discussions https://github.com/timescale/timescale-analytics/discussions/113
Can we stabilize new features for the next release (planned for May 2021):
Potential features:
Before Stabilization
During Stabilization
Notable Feedback
Forthcoming
Open Questions