tlemane / kmtricks

modular k-mer count matrix and Bloom filter construction for large read collections
GNU Affero General Public License v3.0
72 stars 7 forks source link

Efficently extracting counts from subset of kmers? #36

Closed nikostr closed 3 months ago

nikostr commented 4 months ago

I have run kmdiff and identified overrepresented kmers among two groups. Following this, I created a membership matrix to identify kmers present in all my case samples, and intersected these with the overrepresented kmers identified by kmdiff. Now I am interested in getting the counts of these in each of my case samples. I already have the count matrices produced by kmdiff. Dumping these to text and grepping them is obviously one way of doing it, but clearly not very efficient. What would your recommendation be here? Unfortunately my C++ is terrible.

nikostr commented 3 months ago

I posted this question before I understood the merge and aggregate command. In case someone else has the same issue, I solved it by doing the following:

kmtricks merge \
    --recurrence-min $N_CASES \
    --cpr \
    --run-dir kmdiff-count \
    --threads 16

kmtricks aggregate \
    --run-dir kmdiff-count \
    --matrix kmer \
    --format text \
    --cpr-in \
    --output count-matrix.out \
    --threads 16

The first command creates a matrix with kmers occurring in at least as many samples as I have cases (N_CASES), and the second command dumps this as a text file. Following this I grepped count-matrix.out with the list of kmers I had identified previously.

Note: using this count matrix it should be possible to find these kmers without creating the membership matrix.