This means that the cache will be getting a full scan (within the bounds of provided predicates) regardless of the provided projection. For a cache that has multiple levels, if the user is only interested in the top level of the cache, this could lead to unnecessary cycles spent scanning lower levels of the cache; if the user is interested in lower levels of the cache, then we still need to scan through the higher levels, but at the least, we could avoid building the arrow buffers for those columns.
In addition, projection to lower levels of the cache is not ordered, however, that may need a separate issue.
Proposed solution
The projection provided to the TableProvider::scan could be passed down to the MetaCache::to_record_batch to more optimally scan the cache:
do not build arrow buffers for un-needed columns
only scan down to the lowest needed level in the cache
update the MetaCacheExec to include details about projected columns
Alternatives
N/A
Additional context
Currently, DataFusion handles projection at a higher level, so this isn't a show-stopper, the cache will still work as it is intended when projections are provided in the query.
The method that walks the cache hierarchy to do predicate evaluation and build the arrow buffers is here.
An example showing that the output when projecting a lower column is not ordered is here.
Problem
The
TableProvider
implementation for theMetaCacheFunctionProvider
is not currently handling projection pushdown: https://github.com/influxdata/influxdb/blob/20d09a8dda5ac42a6cb388b15edc169f9fbbd709/influxdb3_cache/src/meta_cache/table_function.rs#L46This means that the cache will be getting a full scan (within the bounds of provided predicates) regardless of the provided projection. For a cache that has multiple levels, if the user is only interested in the top level of the cache, this could lead to unnecessary cycles spent scanning lower levels of the cache; if the user is interested in lower levels of the cache, then we still need to scan through the higher levels, but at the least, we could avoid building the arrow buffers for those columns.
In addition, projection to lower levels of the cache is not ordered, however, that may need a separate issue.
Proposed solution
The
projection
provided to theTableProvider::scan
could be passed down to theMetaCache::to_record_batch
to more optimally scan the cache:MetaCacheExec
to include details about projected columnsAlternatives
N/A
Additional context
Currently, DataFusion handles projection at a higher level, so this isn't a show-stopper, the cache will still work as it is intended when projections are provided in the query.
The method that walks the cache hierarchy to do predicate evaluation and build the arrow buffers is here.
An example showing that the output when projecting a lower column is not ordered is here.