Efficently extracting counts from subset of kmers?

tlemane / kmtricks

modular k-mer count matrix and Bloom filter construction for large read collections

GNU Affero General Public License v3.0

76 stars 7 forks source link

I posted this question before I understood the merge and aggregate command. In case someone else has the same issue, I solved it by doing the following:

kmtricks merge \
    --recurrence-min $N_CASES \
    --cpr \
    --run-dir kmdiff-count \
    --threads 16

kmtricks aggregate \
    --run-dir kmdiff-count \
    --matrix kmer \
    --format text \
    --cpr-in \
    --output count-matrix.out \
    --threads 16

The first command creates a matrix with kmers occurring in at least as many samples as I have cases (N_CASES), and the second command dumps this as a text file. Following this I grepped count-matrix.out with the list of kmers I had identified previously.

Note: using this count matrix it should be possible to find these kmers without creating the membership matrix.

tlemane / kmtricks

Efficently extracting counts from subset of kmers? #36