tlemane / kmtricks

modular k-mer count matrix and Bloom filter construction for large read collections
GNU Affero General Public License v3.0
72 stars 7 forks source link

add samples to a previous run #24

Closed jjdevega closed 10 months ago

jjdevega commented 1 year ago

Thanks for kmtricks; we have incorporated it into one of our lab pipelines with significant computing time improvement.

We use kmtricks to generate binary presence/absence matrices from x samples, each from 2-4 fastq files (.fq.gz). These files are significant, and a goal is to remove them from storage after computing.

Our usage is fairly simple: kmtricks pipeline --mode kmer:pa:bin kmtricks aggregate --pa-matrix kmer --format text

My query is, I want to incorporate z additional samples at a later date and recalculate everything, but without bringing back the reads for previous x samples, i.e. adding the new samples from fastq files into the previous run quants.

Is it possible? I have tried to get some ideas from the wiki, but I need help finding something suggesting this is possible and where to start

Thanks for your help.

tlemane commented 1 year ago

Hello,

Unfortunately, it is not possible yet. However, this is on top of my todo and I have already started the implementation. I will keep you posted as soon as a testable version is available.

Note that not all matrices can be merged, only matrices using the same minimizer distribution function can. In the recent release v1.3.0, you can use a new parameter, --repart-from, allowing to use the distribution function of an existing kmtricks run. So while waiting for the merge feature, I suggest to build the matrices using the same function to make them ready for merging.

Ex:

kmtricks pipeline --file matrix_1.txt --run-dir ./matrix_1
kmtricks pipeline --file matrix_2.txt --run-dir ./matrix_2 --repart-from ./matrix_1
kmtricks pipeline --file matrix_3.txt --run-dir ./matrix_3 --repart-from ./matrix_1

I hope this help.

Teo

tlemane commented 1 year ago

Hello,

I still have to make some changes but you can already test the feature on the dev branch. Release and docker/conda packages should be available next week.

Installation

git clone --recursive https://github.com/tlemane/kmtricks.git
cd kmtricks
git checkout dev
./install.sh

Usage

kmtricks pipeline --run-dir ./matrices/mat1
kmtricks pipeline --run-dir ./matrices/mat2 --repart-from ./matrices/mat1
kmtricks pipeline --run-dir ./matrices/mat3 --repart-from ./matrices/mat1
kmtricks combine --fof fof.txt --output ./new_matrix

With fof.txt:

./matrices/mat1
./matrices/mat2
./matrices/mat3

Let me know if you encounter any issues.

Teo