influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.58k stars 3.53k forks source link

Investigate Parquet index #24819

Open pauldix opened 5 months ago

pauldix commented 5 months ago

After a brief discussion with @alamb this morning at breakfast, there's another idea we'd like to investigate for improving performance. An index of Parquet files where the API is something like

The idea here is that if we have a predicate that filters most of the data from a table out and that predicate is part of a series (i.e. the data is sorted by series) then there will be a small number of offsets that will match. And the query engine should be able to just read the parts of the Parquet file to pull out that data.

This is part of a set of investigations of which #24815 is a member.

Some tasks related to this:

### Tasks
- [ ] Investigate how we could feed the specific parquet files and offsets to the query engine (start with DataFusion)
- [ ] Build an in memory prototype that builds up this index as data is ingested
- [ ] Benchmark against various numbers of Parquet files and different levels of cardinality
alamb commented 5 months ago

As I understand it, the goal of this exercise is to figure out what is best possible query performance we can squeeze out of "single series" queries reading parquet data