Fixing maximum kmer count kmc flag

oxygen311 commented 5 months ago

Dear CoLoRd developers team!

Hope you are doing well.

Issue:

While working with a CoLoRd in reference mode we have noticed strange behaviour in the case of length of reads << length of reference. Our investigations led to huge nodes in the similarity graph which are way more frequent than expected.

The only reason it usually works is this line: https://github.com/refresh-bio/colord/blob/25b28600d0716805beffab6941eb2c6b5f77014a/src/colord/reads_sim_graph.cpp#L390 But this condition is supposed to be true in the case of proper filtering.

We've noticed this in reference-based mode because that's no such condition to add a node to a graph from pseudo-reads.

Proposed Fix:

The fix is just switching to use the right flag of kmc tool, -cx instead of -cs. Here is a quotation from kmc help:

>  -ci<value> - exclude k-mers occurring less than <value> times (default: 2)
>  -cs<value> - maximal value of a counter (default: 255)
>  -cx<value> - exclude k-mers occurring more of than <value> times (default: 1e9)

It is also supposed to fix the logic of compression since kmers are supposed to be chosen based on count as well as hash.

Testing:

We have performed thorough testing, including specific scenarios with short reads and long references. Feel free to test it yourself.

Acknowledgments:

A special thanks to @iam28th for their assistance in tracing back to the kmc flag.

Hope this improvement gonna be helpful and improve results.

Best regards, Alexey

marekkokot commented 4 months ago

Hi @oxygen311

It's been a while since I last worked with CoLoRd code, so I must think carefully about this. Thank you for your deep analyses and digging into our code (also, thank you, @iam28th !). I recall that there was a reason to use -cs instead of -cx for KMC, but now I am unsure what that reason was :(.

Oh, I found in an internal conversation of our group from the end of 2020 that we initially had -cx, and we intentionally changed it to -cs and applied filtering when adding to the graph. I guess it was introduced before reference mode, so maybe this is indeed some detail we missed.

I guess you are referring to this part of a paper: "The initial algorithm step is filtering k-mers of the input reads. For this purpose KMC package [28] is executed with k automatically adjusted to the data set. K-mers with less than L = 4 (possibly, sequencing errors), or more than H = 80 occurrences (non-informative) in the entire read set are discarded"

I think this statement is oversimplification, and the H filtering is performed later.

The initial idea remains the same, i.e., we don't want to have non-informative (frequent) k-mers. We faced a problem in the case of high-quality reads (PacBio HiFi). In such a case, frequent k-mers are informative, and discarding them at KMC stage leads to worse reference-reads detection (as far as I am starting to remember now :))

In such a case, frequent K-mers are informative, and discarding them at the KMC stage leads to worse reference-reads finding (as far as I remember :)). On the other hand, keeping them all increases the size of a graph, so we still do some kind of filtering, i.e., adding them to the graph only up to the value of the -H parameter.

I wonder what your data is. Could you share it? I guess your change makes the CoLoRd require less RAM and maybe compute faster. What about the compression ratio? Maybe we should introduce a new special case for a reference genome instead of using cx, but I am not sure yet how it should look. The problem is that we also performed quite a broad testing to produce the final form of code that works well across different kinds of data (at least, we hoped it would work well), and the change you propose may be risky. Have you tried this change for PacBio HiFi?

oxygen311 commented 4 months ago

Hello @marekkokot,

Sorry for the late response!

I think this statement is oversimplification, and the H filtering is performed later. I think it's just a different approach and not a simplification. Implemented logic is different, kmers are not filtered by frequency but are limited. They are just hitting the maximum kmer frequency limit and then new reads are not accepted to the similarity graph with this kmer node. But those added remain.

The problem we've faced can be reproduced with the following data: long genome reference (eg human) and relatively short reads (eg illumina/ont adaptive sampling). I understand that's not the typical case for using but slow down is extreme in that case, up to 100x times. And it may be the case in some other applications.

Have you tried this change for PacBio HiFi? No, I didn't, nor for ONT simplex/duplex high-accuracy reads.

I understand this change may require additional testing, but I think it's crucial for the algorithm to work correctly. If you need help with testing, feel free to ask.

Best regards, Alexey

refresh-bio / colord