tlemane / kmtricks

modular k-mer count matrix and Bloom filter construction for large read collections
GNU Affero General Public License v3.0
72 stars 7 forks source link

Big kmtricks index obtained #11

Closed ColineGardou closed 2 years ago

ColineGardou commented 3 years ago

Dear authors, we have indexed 94 RNA-seq files (total fastq.gz: 201Gb) and we obtained a 441Gb kmtrick index. This looks big compared to your supplementary tables. A merged Jellyfish index made with DEkupl's joincount for the same dataset was only 23Gb. We are wondering whether we are doing something wrong. I'm attaching my code below. Thanks ! kmtricks.txt whole_matrix.txt fof_mondor.txt

tlemane commented 3 years ago

Hello,

Thanks for trying kmtricks. Just to clarify, kmtricks can be used for two things, 1) Build a membership index by building Bloom filters (Supplementary tables relates to this feature), 2) Build a k-mer count matrix. Since you use DEkupl, I assume you need a count matrix, right ?

I have not noticed any problems in your commands. The difference could be explained by the k-mer filtering (--count-abundance-min in kmtricks and --lower-count in Jellyfish). DEkupl joincount uses also -r (--recurrence-min in kmtricks) and -a (no direct equivalent but can probably be simulated by --merge-abundance-min X --save-if 0, I will check).

Also please note that Jellyfish and kmtricks produce equivalent but not identical outputs because of canonical k-mers. For optimization reasons, kmtricks considers A < C < T < G instead of A < C < G < T.

A new version of kmtricks is coming soon, probably next week if I can finish the documentation. It is faster and more efficient, and includes new features, utilities and API, especially for dealing with kmtricks's files. I you want it before the release, just send me an email.

I hope this help.

Téo