Closed nhansen closed 1 year ago
Strandseq suggests two phase shifts at 75,508,263 and 75,514,059
Both sifts are almost not present in HiFi reads, but slightly present in ONT reads and present in EBV+Y (both ONT and HiFi)
It is interesting that position 75,508,263 is a change in sequence from ATAT to TGTG. It is difficult to call. I am inclining to call phase shift at least at this position
This appears to be a falsely flagged phase switch due to StrandSeq and extended paternal hapmers, but I am not confident.
Looking at ONT reads mapped to the squashed assembly, the maternal reads (group 1, match the assembly at the position of the rightmost StrandSeq match call) do not contain the supposed MAT_A and MAT_T switched SNP call, and instead continue to match the maternal assembly as expected. The paternal reads (group 2, do not match the assembly at the StrandSeq M, and share a blue C SNP) do contain support for the supposed switched SNP calls to be correctly phased on the paternal haplotype, also matching the haplotype alignment above. The nearest maternal hapmer is about 76 kb away. Maternal reads anchored there do not contain the MAT_T call, but paternal reads that disagree with the reference there do contain this T call, giving further support for a false positive phase switch.
I wonder if that short paternal sequence aligning had anything to do with StrandSeq issues related to short read mapping.
Interestingly, the first of these SNP calls lie at the boundary of an ATATAT --> GTGTGTG repeat.
An earlier called switched SNP call appears at the boundary of a ATATATAT --> AGAGAGAG repeat.
Even though the phasing appears correct: top group 1 matches maternal assembly on the left rather than containing T SNP and differs on the right rather than containing StrandSeq match, so are paternal. Reads on the bottom, group 2, match on the right, appearing maternal, but do not have much support for the MAT_T call. T is called with significant strand bias (9+, 21-), and only is called in ~25% of the reads, rather than an expected 50-50 if this were a true het SNP. Leading me to believe this call is a result of sequencing error, and not true difference between haplotypes.
However, the HiFi reads come closer to 50-50 (64-36 in favor of G over T), which makes it hard to dismiss, but should be taken with a grain of salt as we are seeing low quality reads in a very repetitive region.
Trying to phase from the nearest maternal hapmer also does not paint a very clear picture. Not many reads contain the T SNP calls, and of the ones that do some appear maternal and some appear paternal. The alignment of the paternal haplotype doesn't show that it exists on the other haplotype either.
But DeepVariant PEPPER calls do show this as a true variant, so think I believe it.
So the two variants at 75508358 and 75514087 look like strand-seq mistakes to me. Anchoring from the confirmed M downstream that has marker k-mers (e.g.unique k-mers in the asm) and looking in ONT reads clearly shows those reads which have the paternal haplotype also have the red T base: agreeing with the paternal haplotype of the asm. The second variant is not clear in ONT data, but we can link the two sites via hifi reads since they're close enough: here reads w/red base go to an insertion and a green base, exactly matching the paternal haplotype again. So together these convince me that strand-seq switching is wrong.
As for the upstream, there seems to be a missed variant that wasn't picked up at coordinate 75460936, the maternal hap matches here but there is no difference for the paternal haplotype when the variant looks real The actual mat_T call that has non 50/50 allele frequency looks real and a missed variant in the assembly (the maternal should have another T here). The reason for the weird allele frequency I think is the tandem repeat. There are some reads which insert/delete bases in the alignment in a way that they are equivalent to a T call but the alignment makes them miss it. There is also an insertion in the paternal haplotype nearby (extra TA) which I don't think is real.
Just attaching the example I was talking about on the call:
Note how the reads w/the T variant are much lower frequency (bin 2 w16 vs 29 reads) but there are a bunch of reads with an upstream 2 bp insertion of an AT followed by a deletion of the first GA:
sequence w/insertion followed by deletion: TGTATATATATATATATATA**T**AGAGAGAGA
reference sequence: TGTATATATATATATATATA**G**AGAGAGAGA
Note that it's identical to a T substitution in the lower frequency variant.
I agree, this change from TATATATAT to AGAGAGA has bothered me from the beginning. Often there could be a problem with dynamic programming tools. Recently we have had a separate discussion about the situation when I/D followed by matches and D/I could be due to wrongly chosen alignment paths. Therefore it is better to avoid situation when mismatch and gap open + first extend are equally penalized.
Also, I have found in several situations here that the ratio of variations at specific positions for ONT reads is 25/75
In the end, corrections were made at 75465679 and 75531715, and alignments look clean:
Assembly Region
chr6_MATERNAL:75508263-75514059
Assembly Version
v0.7