twolinin / longphase

GNU General Public License v3.0
102 stars 9 forks source link

Question on supplementary alignment phasing and tagging #51

Open aysegokce opened 9 months ago

aysegokce commented 9 months ago

Hello,

I wonder how longphase handles supplementary alignments during phasing and haplotagging; as a separate read or they inherit HP/PS of the primary alignments.

Thank you, Ayse

linsindian commented 9 months ago

Hi, @aysegokce,

We view them as separate entities for data processing rather than directly inheriting, thus considering them as distinct reads.

In our phase module, the current approach involves calculating the number of reads passing between two SNPs to determine phasing. The reads passing through are not distinguished based on whether they are primary or supplementary alignments.

In haplotag, we analyze the proportion of SNPs belonging to hp1 and hp2 on a single read to decide which haplotype to tag it with. As this is done independently, it's possible for alignments with the same read ID to belong to different haplotypes if the primary alignment resembles hp1 while the supplementary alignment resembles hp2.

We've noticed cases where primary and supplementary alignments are distant, not belonging to the same phase block. In such situations, we believe directly inheriting haplotype and phase block information from the primary alignment isn't ideal. Given your involvement in Severus's development process, we value any insights or observations you may have on this matter. Do you have any suggestions?

Additionally, I'm interested in the potential impact on Severus if primary and supplementary alignments are assigned to different haplotypes or phase sets. Could you explain how Severus handles this scenario and why it requires attention?

Looking forward to our continued communication.

Best regards, Sin-dian

aysegokce commented 9 months ago

Hi @linsindian, Thanks for the explanation!

We've also noticed the disparity in phasing caused by directly inheriting the haplotype from the primary alignment. I totally agree that this is not ideal, and we prefer them to be handled separately, as you have described.

I don't know how possible it is, but with a second iteration, if the haplotype of primary and supplementary alignments are matched (if they are in the same chromosome), it may help with the randomness in the haplotype assignment. In the case of primary alignment being HP1 and the supplementary being HP2, the haplotype assignment can be switched for the entire phase block of supplementary alignment (0|1 to 0|1 and vice versa), so both can be HP1.

In Severus, we are also actively working on improving the assigning of a haplotype to an SV. We treat both primary and supplementary alignment as separate reads regarding phasing. We check the haplotype of the support reads in a breakpoint, and if it is above a threshold, we assign that haplotype to that breakpoint. However, that approach was causing misclassifications simply because inherited haplotypes in supplementary alignments do not necessarily match the primary alignments in the same region.

We want to learn more about phasing tools and how they handle supplementary alignments so that we can develop a more generalized approach. Our current plan is to make using haplotags in supplementary alignment optional simply because, if inherited, it leads to more problems than it solves, especially in complex SV clustering.

I would like to hear if you have any suggestions.

Thank you Ayse

ythuang0522 commented 9 months ago

Hi @aysegokce , if both the primary and supplementary alignments are in the same block, your suggestion of switch (i.e., HP2 to HP1) is possible. But we had another idea. How about we tag "somatic breakpoint reads" as "HP3" instead? That is, the reads tagged as HP3 span SV breakpoint, which are specific to tumors but not found in normals. This is because the primary and supplementary alignments of the breakpoint read may not be in the same block, although more statistics is required to confirm. We can tag these somatic breakpoint reads as HP3 regardless of the underlying phased blocks. In addition, since these somatic haplotypes/reads are originated from either the normal. HP1 or HP2 haplotypes, it might be more biologically relevant if distinguishing somatic breakpoint reads as HP3. However, this would require Severus to change implementation as well. What's your thoughts on providing HP3 from our side?

aysegokce commented 9 months ago

Hello @ythuang0522, I agree that the altered haplotype differs from the normal and can be defined as a third haplotype, especially in complex events. From Severus' perspective, since we are not "phasing" SVs but instead using phased reads, I would like to know which haplotype is changed because it is an altered version of one of the haplotypes in normal. For our use, assigning HP3 to the somatic breakpoint reads would be similar to an unphased read.

We use the haplotype information for (i) haplotype-specific VAF calculation, which provides additional information on whether the event is clonal or subclonal, and (ii) the haplotype-aware breakpoint graph construction. This information becomes crucial when there are overlapping events in both haplotypes.

It is also important for the downstream analysis. We observed cases with multiple SVs in a gene, and knowing whether they are on the same haplotype helped us to decide if there is a wild-type copy of that gene.

From our tests, a small portion of the reads (compared to the total number of reads in a sample) have a supplementary alignment in a different phase block. But since that also means the affected genome size is larger, those events are most likely to be part of a complex and/or functionally important event.

Thank you Ayse

ythuang0522 commented 9 months ago

Hi @aysegokce

I don't know how possible it is, but with a second iteration, if the haplotype of primary and supplementary alignments are matched (if they are in the same chromosome), it may help with the randomness in the haplotype assignment. In the case of primary alignment being HP1 and the supplementary being HP2, the haplotype assignment can be switched for the entire phase block of supplementary alignment (0|1 to 0|1 and vice versa), so both can be HP1.

If I understood correctly, when the primary/supplementary alignments are in different blocks, you prefer combining them into one, or switch the HP1/HP2 tags for haplotype consistency of the same reads in both blocks. Yes. This is possible and we will see how to implement this.

From Severus' perspective, since we are not "phasing" SVs but instead using phased reads, I would like to know which haplotype is changed because it is an altered version of one of the haplotypes in normal. For our use, assigning HP3 to the somatic breakpoint reads would be similar to an unphased read.

From the haplotagging perspective, I will view tagging the somatic reads (tagged as HP3) is not the same as unphased reads (assuming you mean untagged reads). We tag a read if it's haplotype similarity is above a threshold (>0.6 to either HP1 or HP2). Otherwise, the reads would be untagged due to less confidence (e.g., too much sequencing errors). In addition, if the read spans insufficient number of SNPs, they tend to be untagged. I suspect the shorter supplementary alignments may suffer this issue, though we haven't looked into it. Hope this may help clarify your understanding of how the tagging works.

We have noticed the percentage of tagged reads vary a lot across the entire genome. A few we spotted are related to CNVs. We ever discussed if it's helpful to improve the tagging sensitivity in CNV regions. Would be great if you can provide us some clues how you utilize these untagged reads in Severus. Then we may know which improvement may be helpful to you. We can also arrangea short online meeting if you think it's helpful.

Yao-Ting

aysegokce commented 8 months ago

Hello @ythuang0522,

If I understood correctly, when the primary/supplementary alignments are in different blocks, you prefer combining them into one, or switch the HP1/HP2 tags for haplotype consistency of the same reads in both blocks. Yes. This is possible and we will see how to implement this.

If the primary alignment is in phase block 1 and HP1 and supplementary alignment is in phase block 2 and HP2. Is it possible to swap the tags in phase block2 as all HP1 reads became HP2 and vice versa. Partially similar to HiC phasing to a lesser extent.

From the haplotagging perspective, I will view tagging the somatic reads (tagged as HP3) is not the same as unphased reads (assuming you mean untagged reads). We tag a read if it's haplotype similarity is above a threshold (>0.6 to either HP1 or HP2). Otherwise, the reads would be untagged due to less confidence (e.g., too much sequencing errors). In addition, if the read spans insufficient number of SNPs, they tend to be untagged. I suspect the shorter supplementary alignments may suffer this issue, though we haven't looked into it. Hope this may help clarify your understanding of how the tagging works.

Thank you for the explanation. My concern is when there are two SVs in a phase block, all the reads supporting these breakpoints will be assigned to HP3, if I understood correctly. And in this case, we cannot distinguish if both junctions are in the same haplotype or not. Our main purpose in using phased reads is to make this distinction.

We have noticed the percentage of tagged reads vary a lot across the entire genome. A few we spotted are related to CNVs.

Is this with tagging or phasing? We also observed allelic imbalance-related problems in phasing, especially in tumor-only runs. But I wasn't aware that could be a problem in haplotagging as well.

In Severus, for the phasing part, we only consider phased reads. For the complex SV clustering, we look at the nearby phased SVs (if there are any) and we choose the haplotype maximizing the consistency in local and segment-wise coverage.

I would definitely like to talk more. I learned a lot in this discussion.

Thank you Ayse

linsindian commented 8 months ago

Hello, I would like to confirm two matters:

Firstly: If primary and supplementary are in the same phase set, is it expected that they should be assigned the same haplotype?

We have observed instances in the current LongPhase version where different haplotypes are assigned within the same phase set. As shown in the figure, five yellow regions originate from the same read ID and are all within the same phase set. One is assigned to HP1, two to HP2, and two remain unphased. Would it be beneficial for Severus's subsequent use to unify them to be the same as primary, or to unify them into the more frequently observed haplotype? 1709798110452

Secondly: If primary and supplementary are in different phase sets, is it expected to maintain separate haplotype determination rather than inheritance?

Based on the preceding discussion, it is expected to conduct individual haplotype determination first and then utilize tag information to assist in unifying HP1 and HP2 across the two blocks. Because under normal circumstances, HP1 and HP2 between different blocks cannot guarantee to be the same haplotype on the chromosome. However, if there is information where primary in block 1 is HP1 and supplementary in block 2 is HP2, then all alignments' HP1 and HP2 in block 2 can be swapped accordingly, aiming to achieve as much consistency as possible in haplotypes across blocks.

If my understanding of both matters is accurate, we will endeavor to incorporate these functionalities in subsequent versions. Thank you for your assistance in this discussion,it has been immensely helpful for the development of the LongPhase functionality.

Best, Sin-Dian

aysegokce commented 8 months ago

Hello!

If primary and supplementary are in the same phase set, is it expected that they should be assigned the same haplotype?

That is a common assumption. Although it can be possible through a template switch in DNA repair before the rearrangement leading to that split alignment, I am unsure how common that would be. The position of the reads and haplotypes is a bit unexpected; the only alignment with HP1 is between two HP2s. But from the coverage profile, the region itself also looks suspicious. Is this a borderline case or something you have observed commonly?

Would it be beneficial for Severus's subsequent use to unify them to be the same as primary, or to unify them into the more frequently observed haplotype?

Do you have any measures for quality/confidence for the haplotype? Would that be the primary alignment since it is the longest one in most cases?

If primary and supplementary are in different phase sets, is it expected to maintain separate haplotype determination rather than inheritance?

Yes, definitely. In different phase blocks, inheritance is confusing. I would prefer them to be as you described, but if that is not possible, it is better to keep them separate.

Thank you for incorporating these functionalities, which would help Severus' performance as well! Ayse

Yijun-Tian commented 4 months ago

It should be fine to include secondary and supplementary alignments when doing haplotag steps. But for phasing and variant calling steps, it's better to remove those alignments with FLAGs, like -F 2308

ythuang0522 commented 4 months ago

@Yijun-Tian Thank you for your comments. We throw away secondary alignments during phasing and tagging but still include supplementary alignments in the hope of improving the phasing range across SVs. My interpretation is, in your experience, that the removal of supplementary would improve the accuracy of variant calling and phasing. Is that true?

Yijun-Tian commented 4 months ago

@Yijun-Tian Thank you for your comments. We throw away secondary alignments during phasing and tagging but still include supplementary alignments in the hope of improving the phasing range across SVs. My interpretation is, in your experience, that the removal of supplementary would improve the accuracy of variant calling and phasing. Is that true?

Sorry, I don't have a example to support my opinion. It's just a feeling that variant calling and phasing should use the reads that has the highest mapping accuracy.