sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
463 stars 78 forks source link

sketching files containing many small sequences: `manysketch` is astonishingly fast #3252

Open ctb opened 1 month ago

ctb commented 1 month ago

I'm trying to sketch the RVDB, the Reference Viral Genome Database. The clustered file is ~600 MB.

sourmash scripts manysketch C-RVDBvCurrent.manysketch.csv -o C-RVDBvCurrent.manysketch.zip -p dna,k=21,scaled=1000 --singleton

took about 5 minutes.

sourmash sketch dna -p k=21 C-RVDBvCurrent.fasta.gz -o C-RVDBvCurrent.sig.zip --singleton

didn't finish in 24 hours.

what's the reason!? By my understanding manysketch isn't multithreaded when reading single FASTA files, so it's not multithreading. Presumably just the Python for loop penalty and/or using screed!? Wow.

On a mostly unrelated note, the sig.zip file is larger than the FASTA file. So that sucks.

ctb commented 1 month ago

and on a further somewhat unrelated note, fastgather took even less time than sketching.

ctb commented 1 month ago

and even more so, to add a sketch it is faster to

than it is to run sig cat to combine the old database with a new sketch 😭