Closed jonkeane closed 3 years ago
Well, it turns out open_dataset("single_file.parquet")
is significantly faster than reading the file in (even as an Arrow table) and operating on it. I'm a little bit surprised (especially for the feather files!), but I wonder if the dataset scans of the files are more optimized than a query against a table that is backed by a file?
I'm also pretty surprised that open_dataset(feather_file)
is faster than the query engine against the table already resident in memory (the "native" format).
Any thoughts about an explanation for these oddities @bkietz?
With datasets, our performance against parquet files is in line with DuckDB's, though our native query processing is considerably longer.
Turning off memory mapping has the impacts I expected:
We have a few questions that we need to anser:
should we keep the read_parquet|feather()
along side open_dataset()
or just use the dataset processing? I can leave both in and only use one as a default (though if we were to do that, I would change the way they are specified).
what our defaults should be (and what should be run on every commit in conbench — which should be the same thing though they could technically be different). I propose:
@nealrichardson I'm planning to merge this today, unless you have any other things you would like to change or more time to look at it.
Possible scope creeps (that probably aren't worth doing right now, if we're going to separate out and have a separate library for managing sources)