Open bjhartin opened 2 years ago
Hi @bjhartin, by specification you mean just a description / documentation of the storage format? That shouldn't be hard, as it essentially requires describing tdigest_t
and centroid_t
structures, and tdigest_serial
/ tdigest_deserial
functions. The format is pretty simple, really.
The only issue I can think of is endinanness, which is not fixed and depends on the architecture. So far the stored data is only ever exchanged through in/out or send/receive functions, which do the conversions, not directly. But allowing exchanging in the storage format would require that.
@tvondra Could you please describe what would the tdigest table look like? I don't understand how I can store a tdigest in the database if I compute it in my application.
This is the dictionary I get from the t-digest Python library:
{"n": 100000, "delta": 0.01, "K": 25, "centroids": [{"m": 1.0, "c": 17.0}, {"m": 2.0, "c": 8.0}, {"m": 3.0, "c": 4.0}, ...]}
Would it be enough if I create a table called "centroids" with fields "m" and "c"? How can I instantiate a tdigest in one of my SQL queries?
Thank you
Use Case
I'd like to aggregate using a tdigest library within the app, publish once per minute to a tdigest table, then aggregate across instances/intervals via SQL. This works very well with postgres-hll+java-hll, because there is a shared storage specification, i.e. I can serialize the java sketches to the DB.
Request
Publish the details of how you store the TDigest sketch so that language-specific implementations can serialize to it and thus have aggregation 'all the way down', eliminating the need for a table that holds large numbers of individual values.