maickrau / GraphAligner

MIT License
256 stars 30 forks source link

How do I use the diploid heuristic? #94

Open JingaJenga opened 10 months ago

JingaJenga commented 10 months ago

I noticed the new parameters --diploid-heuristic and --diploid-heuristic-cache in the new version of GraphAligner. This looks like a super cool feature and I'm excited to try it out - but I have no idea how. Can I get a brief explanation of what the diploid heuristic is, and how it works? And can I get a recommendation for what K value(s) it might make sense to use with --diploid-heuristic?

I couldn't find any documentation about the diploid heuristic in the README, in the source code, in any issues or pull requests on this repo, or in the publication.

maickrau commented 10 months ago

The parameters are hidden because it's meant for a very specific use case. You can use them with the same parameters we will use in an upcoming version of verkko: --diploid-heuristic 21 31. For the parameter to work it requires that the graph is a diploid assembly graph, and the reads are from the exact same sample as the graph. Alignment to pangenome graphs will not be improved by the parameter.

The way the heuristic works is that it detects haplotype specific k-mers (given by the parameters, in this case 21-mers and 31-mers), builds an index of the haplotype specific k-mers and their nodes, matches the haplotype specific k-mers to the reads, and then forbids alignments to nodes from the other non-matching haplotype. So for the heuristic to work, each node in the graph must either fully match the read's haplotype or come from a different haplotype. This is the reason why it won't help with pangenome graphs since this guarantee does not hold there.

The cache parameter --diploid-heuristic-cache temporary_cache_file_name is for the case if you run GraphAligner multiple times on the same graph. In that case you would first run it with an empty read file to generate the cache, and after that is finished you can run multiple jobs in parallel using the same cache. This will compute the haplotype specific k-mers only once instead of every time time GraphAligner starts.

JingaJenga commented 10 months ago

Wow, thank you so much for the quick and thorough explanation! I didn't realize you were cross-developing between GraphAligner and verkko, but it totally makes sense.