Closed ivirshup closed 3 months ago
Here I was thinking that we should just test against duckdb, so that it's an option instead of the default. However, ibis
now does call duckdb
the "default backend: https://ibis-project.org/posts/why-duckdb/.
I'm still a little against defaulting to duckdb
based on some previous experience. It used to be that the duckdb sqlite extension couldn't be used on older versions of macOS because the distributed binaries relied on a feature only available in newer macOS releases. I don't know if this is still an issue, and am a little unsure how to test it or how old of an OS I was using at the time.
But I'm generally in favor of defaulting to compatibility, with opt-in for performance, especially if the results can differ.
WDYT @thomas-reimonn?
I'm also curious about what the performance of this looks like if the data is stored in parquet or duckdb native formats. Any interest in investigating?
I wrote a short script to convert all tables from .sqlite to .duckdb format. The file size shrinks from 423M (sqlite) to 124M (duckdb).
Running the tests using duckdb with the .duckdb data takes: pytest 31.40s user 3.36s system 180% cpu 19.214 total
Running the tests as usual (almost all sqlite, a couple with duckdb against the .sqlite file) takes: pytest 41.83s user 8.93s system 97% cpu 51.893 total
Running the tests with duckdb and the .sqlite data takes: pytest 49.60s user 8.61s system 192% cpu 30.310 total
Could you share that script? I want to try out a few queries. Also parquet.
Also did you use any indexing (e.g. ORDER BY
, UNIQUE
)?
Here's the sqlite to duckdb conversion script: https://github.com/thomas-reimonn/genomic-features/blob/feat/duckdb-backend/convert-sqlite-to-duckdb.py
As part of the duckdb PR, all queries are sorted by all cols
in the order they're provided. I didn't add any other indexing to the database.
I haven't tried parquet because of the possibility of requiring joins.
Interestingly, I don't see too significant of a performance boost from using the duckdb files
I think this issue is basically done, so will close. Thanks for the implementation!
This is now resolved, and we are discussing other performance topics below.
Description of feature
From past experience, it seems that operations in this library can be sped up quite a bit by using duckdb as the backend. Conveniently, we directly work with sqlite files while using the duckdb backend.
We should run tests with the duckdb backend and document how users can select it.