TPC-H benchmarks - Githubissues

jonkeane commented 3 years ago

[x] Add answer correctness validation
[x] Add an option to use IPC files as base
[x] Figure out why codecov is unhappy

Possible scope creeps (that probably aren't worth doing right now, if we're going to separate out and have a separate library for managing sources)

[ ] refactor known_sources to use the generator pattern I used with TPC-H where the download step would live in a function and each known_source would have that function available

jonkeane commented 3 years ago

Well, it turns out open_dataset("single_file.parquet") is significantly faster than reading the file in (even as an Arrow table) and operating on it. I'm a little bit surprised (especially for the feather files!), but I wonder if the dataset scans of the files are more optimized than a query against a table that is backed by a file?

I'm also pretty surprised that open_dataset(feather_file) is faster than the query engine against the table already resident in memory (the "native" format).

Any thoughts about an explanation for these oddities @bkietz?

With datasets, our performance against parquet files is in line with DuckDB's, though our native query processing is considerably longer.

Turning off memory mapping has the impacts I expected:

For the native workflow, the first iteration is no longer considerably longer (since the file is already resident in memory before timing starts)
For the feather and parquet there's mixed results — here reading the file in happens during the benchmark timing.

TPC-H.html.zip

We have a few questions that we need to anser:

should we keep the read_parquet|feather() along side open_dataset() or just use the dataset processing? I can leave both in and only use one as a default (though if we were to do that, I would change the way they are specified).
what our defaults should be (and what should be run on every commit in conbench — which should be the same thing though they could technically be different). I propose:
- scale: 1, 10, (possibly 100 if we can get the ursa machines to generate that much data)
- engine: arrow
- format: native, parquet, feather (with the two files being driven by datasets)
- mem_map: false (only applies to native, this will make the first iteration more in line with the others)

jonkeane commented 3 years ago

@nealrichardson I'm planning to merge this today, unless you have any other things you would like to change or more time to look at it.

voltrondata-labs / arrowbench

TPC-H benchmarks #33