scverse / genomic-features

Genomic Features in Python from BioConductor's AnnotationHub
https://genomic-features.readthedocs.io
BSD 3-Clause "New" or "Revised" License
18 stars 5 forks source link

Test against duckdb #58

Closed ivirshup closed 3 months ago

ivirshup commented 3 months ago

This is now resolved, and we are discussing other performance topics below.

Description of feature

From past experience, it seems that operations in this library can be sped up quite a bit by using duckdb as the backend. Conveniently, we directly work with sqlite files while using the duckdb backend.

We should run tests with the duckdb backend and document how users can select it.

ivirshup commented 3 months ago

Here I was thinking that we should just test against duckdb, so that it's an option instead of the default. However, ibis now does call duckdb the "default backend: https://ibis-project.org/posts/why-duckdb/.

I'm still a little against defaulting to duckdb based on some previous experience. It used to be that the duckdb sqlite extension couldn't be used on older versions of macOS because the distributed binaries relied on a feature only available in newer macOS releases. I don't know if this is still an issue, and am a little unsure how to test it or how old of an OS I was using at the time.

But I'm generally in favor of defaulting to compatibility, with opt-in for performance, especially if the results can differ.

WDYT @thomas-reimonn?

ivirshup commented 3 months ago

I'm also curious about what the performance of this looks like if the data is stored in parquet or duckdb native formats. Any interest in investigating?

thomas-reimonn commented 3 months ago

I wrote a short script to convert all tables from .sqlite to .duckdb format. The file size shrinks from 423M (sqlite) to 124M (duckdb).

Running the tests using duckdb with the .duckdb data takes: pytest 31.40s user 3.36s system 180% cpu 19.214 total

Running the tests as usual (almost all sqlite, a couple with duckdb against the .sqlite file) takes: pytest 41.83s user 8.93s system 97% cpu 51.893 total

Running the tests with duckdb and the .sqlite data takes: pytest 49.60s user 8.61s system 192% cpu 30.310 total

ivirshup commented 3 months ago

Could you share that script? I want to try out a few queries. Also parquet.

Also did you use any indexing (e.g. ORDER BY, UNIQUE)?

thomas-reimonn commented 3 months ago

Here's the sqlite to duckdb conversion script: https://github.com/thomas-reimonn/genomic-features/blob/feat/duckdb-backend/convert-sqlite-to-duckdb.py

As part of the duckdb PR, all queries are sorted by all cols in the order they're provided. I didn't add any other indexing to the database.

thomas-reimonn commented 3 months ago

I haven't tried parquet because of the possibility of requiring joins.

ivirshup commented 3 months ago

Interestingly, I don't see too significant of a performance boost from using the duckdb files

ivirshup commented 3 months ago

I think this issue is basically done, so will close. Thanks for the implementation!