JLockerman commented 3 years ago

Can we stabilize new features for the next release (planned for May 2021):

Potential features:

Before Stabilization

[x] Check on-disk layout
[x] Continuous aggregate tests
[x] Edge case tests
[x] Benchmark for pathological performance
[x] Edit Uddsketch on-disk buckets to be delta-encoded varints
[x] Gather more user feedback
[x] Audit public APIs for consistency (https://github.com/timescale/timescale-analytics/discussions/113)
[x] Audit documentation for each type
[x] Remove experimental schema and markers for each type
[ ] ~Add stability marker for each type~ Not doing stability-based updates right now
[ ] ~Add update tests~ rename means this is the first version, add them for the next
[x] ping docs in https://github.com/orgs/timescale/projects/10

During Stabilization

[x] move features out of experimental schema
[x] add features to the stable list
[x] open PR on new docs repo for features

Notable Feedback

Forthcoming

Open Questions

[ ] disk-layout tests?

JLockerman commented 3 years ago

Argument Layout

Which order do we want arguments to be in? Right now we have

tdigest(buckets, value)
uddsketch(buckets, error, value)

another option is

tdigest(value, buckets)
uddsketch(value, buckets, error)

this would allow us to default some of the trailing arguments if we want

JLockerman commented 3 years ago

Function Unification

Right now we have a function-per-algorithm, ie.

tdigest(buckets, value)
uddsketch(buckets, error, value)

another option is to make the algorithm an argument

approximate_quantile(value, 'tdigest', buckets)
approximate_quantile(value, 'uddsketch', buckets, error)

(note that for this version all approximate quantiles have to return the same type, currently they don't because the different sketches offer different capabilites)

or

approximate_quantile(value, 'tdigest', '{"buckets": 100}')
approximate_quantile(value, 'uddsketch', '{"buckets": 100, "error":0.10}')

(note that for this version all approximate quantiles have to return the same type)

or

approximate_quantile(value, NULL::tdigest, buckets)
approximate_quantile(value, NULL::uddsketch, buckets, error)

Discussion

approximate_quantile(value, NULL::tdigest, buckets) is ugly enough that we don't think it's tenable. It's not obvious if the unified API makes up for the type confusion in the other alternatives.

JLockerman commented 3 years ago

Naming Principles

Aliases

For functions with a sensible default, it may pay to alias the the algorithm under a more general name. eg. have

approximate_quantile(...)

as an alias for

tdigest(...)

Descriptive Name

It may be useful to have names describe what the function does in addition to what the algorithm is. There are no very good examples in this release, the only one to consider is tdigest_sketch(...) for tdigest(...), a better example is lttb_downsample(...) instead of lttb(...).

For the approximate quantiles, they could be called udd_approximate_quantile(...) and tdigest_approximate_quantile(...)

We could possibly have both long and short names, eg. tdigest_approximate_quantile(...) and tdigest(...)

We could also name based on how it's supposed to work, eg. approximate_quantile(...) for tdigest and approximate_quantile_with_error(...) for uddsketch

JLockerman commented 3 years ago

cc @avthars, @ryanbooz, @mfreed, @cevian. You might be interested in this.

JLockerman commented 3 years ago

can we change the uddsketch disk layout to be

UddSketch {
    alpha: f64,
    max_buckets: u32,
    num_buckets: u32,
    compactions: u64,
    count: u64,
    sum: f64,
    first_non_nengative_key: u32,
    key_is_zero: bool,
    padding: [u8; 3],
    keys: [i64; self.num_buckets],
    counts: [u64; self.num_buckets],
}

it should save about 8 bytes per bucket

JLockerman commented 3 years ago

Downsampling is much easier to unify as downsample('lttb', value) since all downsampling returns the same type.

davidkohn88 commented 3 years ago

the get_foo (get_count, get_min etc) functions for each of these: a) the name isn't the best b) in the current documentation, we refer to them in headers/links as tdigest_foo or uddsketch_foo, which is confusing.

davidkohn88 commented 3 years ago

We're thinking we want to add a generic percentile_approx or similar function before stabilization, at the very least a docs page that gets people pointed in the right direction and gives people where to start if they want to just use an approximate percentile algo and want something that will "do the right thing"

JLockerman commented 3 years ago

API discussion is now moved to Discussions #113

JLockerman commented 3 years ago

Update 2020/04/20

Edge-Case Tests

@WireBaron is currently fuzzing t-digest (PR #118), and uddsketch.

On-Disk Layouts

Analysis of the t-digest on-disk layout found nothing of note.
Analysis of the uddsketch on-disk layout found a potential 10x space improvement (from 2KB to around 300B) by delta-encoding the buckets and storing them in varints. This is a large enough improvement that we will likely change the layout to this before stabilization.

API Finalization

Discussion is happening in Discussions https://github.com/timescale/timescale-analytics/discussions/113

timescale / timescaledb-toolkit

First Stabilization Checklist #107

Potential features:

Before Stabilization

During Stabilization

Notable Feedback

Open Questions

Argument Layout

Function Unification

Discussion

Naming Principles

Aliases

Descriptive Name

Update 2020/04/20

Edge-Case Tests

On-Disk Layouts

API Finalization