We've tried using Apache DataFusion for the new query engine. On average, it is slower than DuckDB but it's much more hackable and extensible, which may be profitable for our use case. At least it allows us to build the expression trees directly instead of forming an SQL string and parsing it back.
Looking at how the query performance can be optimized, there are several approaches to try:
[x] Define strict schema for the parquet files
[x] Stream query results instead of full-batch processing
[x] Try different approaches for cross-table lookups
[ ] Use bloom filters to speed up queries for sparse data
DataFusion uses the Tokio scheduler to parallelize CPU-intensive tasks, which is a questionable design decision.
It doesn't fully utilize the CPU, probably idling on some IO operations
I couldn't find a way to provide a sort order, however their presentation suggests that it should help optimizing queries.
Logical plan optimizer is far from perfect. For example, it doesn't optimize out unions of sets with their subsets, resulting in two scans for such case. As a result, we have to build queries carefully on the client side.
Providing a schema doesn't speed up things (for some reason even makes it slower by ~8% according to micro benchmarks).
We've tried using Apache DataFusion for the new query engine. On average, it is slower than DuckDB but it's much more hackable and extensible, which may be profitable for our use case. At least it allows us to build the expression trees directly instead of forming an SQL string and parsing it back.
Looking at how the query performance can be optimized, there are several approaches to try: