refresh-bio / KMC

Fast and frugal disk based k-mer counter
266 stars 73 forks source link

feature request: unbounded calculation of coverage histogram & reporting non-empty coverages #124

Open KamilSJaron opened 5 years ago

KamilSJaron commented 5 years ago

Hello,

Thank you for this wonderful tool. Would it be possible to ask for two small features?

We are using coverage histograms to estimate genomic properties. We have found cases where we had to take into account kmers with extremely high coverage to get meaningful estimates (related to #115; something like -cx50000000). This is impractical because we need to figure out first what is the coverage of the most covered kmer or overshoot the number and generate an enormous histogram.

I would suggest two subtle changes:

When super repetitive kmers are present, there is usually just a handful of them. Therefore the histogram file is full of coverages with 0 kmers. I suggest reporting only coverages that carry at least one kmer. For instance instead of

...
15651666   0
15651667    1
15651668    0
15651669    0
15651670    0
15651671    1
15651672    0
...

we would have

...
15651667    1
15651671    1
...
hannesbecher commented 3 years ago

I would like to second this suggestion! Yes, yes.

hannesbecher commented 3 years ago

As an alternative, it would be useful to have the option to write to stdout, so one could filter the output stream with awk etc.

marekkokot commented 3 years ago

Hello, thanks for using kmc and kmc_tools. I know there is a couple of feature requests, I hope I will find a time to implement at least some of them. For now, it is in fact possible to write to stdout and filter with awk. If the kmc output is o you may use:

bin/kmc_tools -hp transform o histogram /dev/stdout | awk '{if ($2 != 0) {print}}'