voltrondata-labs / arrowbench

R package for benchmarking
Other
13 stars 9 forks source link

TPC-H benchmarks #33

Closed jonkeane closed 3 years ago

jonkeane commented 3 years ago

Possible scope creeps (that probably aren't worth doing right now, if we're going to separate out and have a separate library for managing sources)

jonkeane commented 3 years ago

Well, it turns out open_dataset("single_file.parquet") is significantly faster than reading the file in (even as an Arrow table) and operating on it. I'm a little bit surprised (especially for the feather files!), but I wonder if the dataset scans of the files are more optimized than a query against a table that is backed by a file?

I'm also pretty surprised that open_dataset(feather_file) is faster than the query engine against the table already resident in memory (the "native" format).

Any thoughts about an explanation for these oddities @bkietz?

With datasets, our performance against parquet files is in line with DuckDB's, though our native query processing is considerably longer.

Turning off memory mapping has the impacts I expected:

TPC-H.html.zip

We have a few questions that we need to anser:

jonkeane commented 3 years ago

@nealrichardson I'm planning to merge this today, unless you have any other things you would like to change or more time to look at it.