timescale / timescaledb-toolkit

Extension for more hyperfunctions, fully compatible with TimescaleDB and PostgreSQL 📈
https://www.timescale.com
Other
377 stars 47 forks source link

First Stabilization Checklist #107

Closed JLockerman closed 2 years ago

JLockerman commented 3 years ago

Can we stabilize new features for the next release (planned for May 2021):

Potential features:

Before Stabilization

During Stabilization

Notable Feedback

Forthcoming

Open Questions

JLockerman commented 3 years ago

Argument Layout

Which order do we want arguments to be in? Right now we have

tdigest(buckets, value)
uddsketch(buckets, error, value)

another option is

tdigest(value, buckets)
uddsketch(value, buckets, error)

this would allow us to default some of the trailing arguments if we want

JLockerman commented 3 years ago

Function Unification

Right now we have a function-per-algorithm, ie.

tdigest(buckets, value)
uddsketch(buckets, error, value)

another option is to make the algorithm an argument

approximate_quantile(value, 'tdigest', buckets)
approximate_quantile(value, 'uddsketch', buckets, error)

(note that for this version all approximate quantiles have to return the same type, currently they don't because the different sketches offer different capabilites)

or

approximate_quantile(value, 'tdigest', '{"buckets": 100}')
approximate_quantile(value, 'uddsketch', '{"buckets": 100, "error":0.10}')

(note that for this version all approximate quantiles have to return the same type)

or

approximate_quantile(value, NULL::tdigest, buckets)
approximate_quantile(value, NULL::uddsketch, buckets, error)

Discussion

approximate_quantile(value, NULL::tdigest, buckets) is ugly enough that we don't think it's tenable. It's not obvious if the unified API makes up for the type confusion in the other alternatives.

JLockerman commented 3 years ago

Naming Principles

Aliases

For functions with a sensible default, it may pay to alias the the algorithm under a more general name. eg. have

approximate_quantile(...)

as an alias for

tdigest(...)

Descriptive Name

It may be useful to have names describe what the function does in addition to what the algorithm is. There are no very good examples in this release, the only one to consider is tdigest_sketch(...) for tdigest(...), a better example is lttb_downsample(...) instead of lttb(...).

For the approximate quantiles, they could be called udd_approximate_quantile(...) and tdigest_approximate_quantile(...)

We could possibly have both long and short names, eg. tdigest_approximate_quantile(...) and tdigest(...)

We could also name based on how it's supposed to work, eg. approximate_quantile(...) for tdigest and approximate_quantile_with_error(...) for uddsketch

JLockerman commented 3 years ago

cc @avthars, @ryanbooz, @mfreed, @cevian. You might be interested in this.

JLockerman commented 3 years ago

can we change the uddsketch disk layout to be

UddSketch {
    alpha: f64,
    max_buckets: u32,
    num_buckets: u32,
    compactions: u64,
    count: u64,
    sum: f64,
    first_non_nengative_key: u32,
    key_is_zero: bool,
    padding: [u8; 3],
    keys: [i64; self.num_buckets],
    counts: [u64; self.num_buckets],
}

it should save about 8 bytes per bucket

JLockerman commented 3 years ago

Downsampling is much easier to unify as downsample('lttb', value) since all downsampling returns the same type.

davidkohn88 commented 3 years ago

the get_foo (get_count, get_min etc) functions for each of these: a) the name isn't the best b) in the current documentation, we refer to them in headers/links as tdigest_foo or uddsketch_foo, which is confusing.

davidkohn88 commented 3 years ago

We're thinking we want to add a generic percentile_approx or similar function before stabilization, at the very least a docs page that gets people pointed in the right direction and gives people where to start if they want to just use an approximate percentile algo and want something that will "do the right thing"

JLockerman commented 3 years ago

API discussion is now moved to Discussions #113

JLockerman commented 3 years ago

Update 2020/04/20

Edge-Case Tests

@WireBaron is currently fuzzing t-digest (PR #118), and uddsketch.

On-Disk Layouts

Analysis of the t-digest on-disk layout found nothing of note.
Analysis of the uddsketch on-disk layout found a potential 10x space improvement (from 2KB to around 300B) by delta-encoding the buckets and storing them in varints. This is a large enough improvement that we will likely change the layout to this before stabilization.

API Finalization

Discussion is happening in Discussions https://github.com/timescale/timescale-analytics/discussions/113