influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.17k stars 3.51k forks source link

Last N Value Cache #25091

Open hiltontj opened 5 days ago

hiltontj commented 5 days ago

A Last N Value Cache will allow users to access the last value of many series (either by identifier or group) very quickly (<10ms).

Users should be able to specify for a given table and set of columns, the last N values they want to keep cached in RAM. This will be a feature available in both open source and Pro, but there will be limitations in the former.

For a given table, the user would specify the lookup key (i.e. columns to lookup by), the number of values to cache, and the columns (either by name or *) that they want in the cache. The time of the values will always be included.

Cache Creation

To create a cache, users specify:

  1. Name of the table
  2. Key columns (default to series key, or tag set sort order)
  3. Number of values to store (default to 1, limit to 10)
  4. Columns to cache (default to all non-tag/key fields)
  5. (Pro only) How far back to query to load cache on boot-up

We would like the front-end for this to be available via a REST API.

The configuration of each cache will be stored in the catalog.

Populating the Cache

In open source, the cache should be populated as a write through while the server is running. In Pro, this will also be the case, but Pro will also have the ability to fill the cache from historical data on boot-up.

Cache Queries

Querying the cache will require a specialized query. The query syntax could look like so:

SELECT foo, time FROM last_cache('some_table');
SELECT foo, time FROM last_cache('some_table') WHERE cola in ['pepsi', 'coke'];
SELECT foo, time FROM last_cache('some_table') WHERE key_col = 'someval';

This is a use-case for DataFusion's User-Defined Table Functions (UDTF).

In some cases, query predicates may be handled directly by the cache's TableProvider/TableFunctionImpl, while more complicated predicates could just be passed back up to the query engine, but where we draw that line remains TBD.

Other Requirements

### Tasks
- [ ] https://github.com/influxdata/influxdb/issues/25092
- [ ] https://github.com/influxdata/influxdb/issues/25093
- [ ] https://github.com/influxdata/influxdb/issues/25095
- [ ] https://github.com/influxdata/influxdb/issues/25096
- [ ] https://github.com/influxdata/influxdb/issues/25097
- [ ] https://github.com/influxdata/influxdb/issues/25098
- [ ] https://github.com/influxdata/influxdb/issues/25099
- [ ] https://github.com/influxdata/influxdb/issues/25100
alamb commented 5 days ago

BTW another potential implementation would be to use an OptimizerRule to rewrite plans with relevant references to use a new table provider. Here is an example of how to do that: https://github.com/apache/datafusion/pull/11087

pauldix commented 5 days ago

I'd like it to be explicit to the user that they're requesting values from the cache. That way they know the semantics behind it (i.e. the cache only has data from when the server was running and accepting writes).

We could do the optimizer in addition to that, but ensuring the actual result is the same as a non-optimized result will be tricky as it's just a cache and not the raw underlying data.