twolinin / longphase

GNU General Public License v3.0
98 stars 6 forks source link

True insertion is phased but not the supporting reads #47

Closed f-ferraro closed 3 months ago

f-ferraro commented 6 months ago

Dear developers, Thank you for your work on longphase!

We used longphase to first phase various vcfs (generated by DeepVariant, and modcall, and various SV callers) and then used them to tag the reads in a bam file. We are generally satisfied with the performance! :)

However, we encountered a case of a single base insertion (that we know to be true and heterozygous) that has been correctly phased (GT:GQ:DP:AD:VAF:PL:PS 0|1:24:35:18,16:0.457143:24,0,30:48783330) but when we look at the bam file, there's actually no phasing and the reads supporting the variant are on both haplos. We looked at other indels in the same bam and it looks like the reads have been phased according to what we would expect from the information in the vcfs. A guess of ours is that for the other variants there are SNPs in their proximity that are being used by longphase haplotag while the single base insertion doesn't have a SNP so its read are not phased.

Could you please shed some light on this? Thank you in advance! :)

ythuang0522 commented 6 months ago

@f-ferraro The haplotag module only tag reads by SNPs (and SVs) but not indels at the current version, which aimed to address the large amount of indel errors in R9 flow cells. Therefore, if there are no other SNPs in the same indel reads, these indel-only reads will not be tagged (albeit indels are still used in the phasing algorithm). Having said that, we have never seen the indel-only reads before. Would be great to have the snapshot of your IGV or the partial bam around the indel/SNPs, as there are still other possibilities (e.g., the tagged reads must have >60% similarity to one haplotype).

We have noticed the significant reduction of indel errors in R10.4/Q20 reads. The next release will support read tagging with indels, which should be able to tag your indel-only reads.

f-ferraro commented 6 months ago

Hi, @ythuang0522, thank you for the explanation!

Please find below a snapshot of the region in IGV (unfortunately I cannot share more). You can see the variant of interest (single base insertion) in the middle and around it homozygous variants on both sides (the rightmost is a hom single base deletion even if it's hard to see at this distance).

image

Do you perhaps have an idea of when the next longphase version will be released? Looking forward to it!

ythuang0522 commented 6 months ago

We may release a minor version including the indel tagging about a month later. Would be helpful if you could mail us the indel only bam/vcf region for development/testing. e.g. samtools view input.bam "Chr10:18000-45500" > output.bam

ythuang0522 commented 3 months ago

Hi @f-ferraro The v1.7 has enabled indel tagging when the provided VCF contains phased indels. The indel tagging was complicated by the presence of ONT indel errors. Without any ground truth for tagging, we simply use the same heuristics in the phasing stage to predict the read contains an indel or not. Your feedback is welcome.

f-ferraro commented 3 months ago

Hi @ythuang0522! Thank you for working this out and for the heads-up. I tried phasing with v1.7 and it worked greatly.

Thank you!