Open AdmiralenOla opened 2 years ago
Upon closer reading of your paper I see that you are aware of this already:
Another limitation is for variants that differ only by isolated SNVs separated by long conserved genomic regions longer than the read length which may not be accurately inferred by CliqueSNV. While such situations usually do not occur for viruses, where mutations are typically densely concentrated in different genomic regions, we plan to address this limitation in the next version of CliqueSNV.
Is this still planned for an upcoming version?
Hello, @AdmiralenOla
Sorry for the late reply. Yes, we are aware of such behavior. It's not really clear what to do with such cases. Some samples may have plenty of such isolated SNVs. For example if you look and corona virus data it is pretty long and mutations are distributed on longer ranges than one read can cover. And we may have sites with 5-50% variant frequency.
Should we try to attach such "orphan" mutations? To what haplotype then? Unfortunately, it is not clear where to get the information to get this decision.
Even if we have just two pairs of linked SNVs far from each other, it is not clear if they come from the same haplotype or different. So we report two haplotypes.
Those are shortcomings of the technology.
Thanks for your reply, @vtsyvina. I agree that this is a limitation in the technology .
However, you noted in your paper that you had a plan to address this limitation, and that got me curious. For example, some type of probabilistic framework that assumes the haplotypes have similar SNV frequency at all sites may in some cases be used to assign full-length haplotypes.
Dear CliqueSNV team,
I've been experimenting with your tool and think perhaps I have found a bug. If there is a single, isolated SNV with no other SNVs in linkage within the mapping reads, i.e. distance to other SNVs is greater than read length, that SNV is never called in any of the haplotypes.
I'm trying to understand the algorithm described in your paper, and I guess this makes sense, because these SNVs are not in cliques with any other SNVs(?) But for some types of data it will mean that common haplotypes will not be present in the results. In one of my examples, there is a clear 45/55% distribution between C and T at a particular site, and the total read depth is around 30,000.
Graphic presentation of my problem.
I can provide bam files for testing if you'd like.