marbl / HG002-issues

HG002 human reference genome issue tracking and polishing
10 stars 0 forks source link

Issue: chr6_MATERNAL:75508263-75514059 #368

Closed nhansen closed 1 year ago

nhansen commented 1 year ago

Assembly Region

chr6_MATERNAL:75508263-75514059

Assembly Version

v0.7

msikic commented 1 year ago

Strandseq suggests two phase shifts at 75,508,263 and 75,514,059

Screenshot 2023-02-15 at 8 38 47 AM
msikic commented 1 year ago

Both sifts are almost not present in HiFi reads, but slightly present in ONT reads and present in EBV+Y (both ONT and HiFi)

Screenshot 2023-02-15 at 8 19 25 AM Screenshot 2023-02-15 at 8 16 03 AM
msikic commented 1 year ago

It is interesting that position 75,508,263 is a change in sequence from ATAT to TGTG. It is difficult to call. I am inclining to call phase shift at least at this position

steven-solar commented 1 year ago

This appears to be a falsely flagged phase switch due to StrandSeq and extended paternal hapmers, but I am not confident.

Looking at ONT reads mapped to the squashed assembly, the maternal reads (group 1, match the assembly at the position of the rightmost StrandSeq match call) do not contain the supposed MAT_A and MAT_T switched SNP call, and instead continue to match the maternal assembly as expected. The paternal reads (group 2, do not match the assembly at the StrandSeq M, and share a blue C SNP) do contain support for the supposed switched SNP calls to be correctly phased on the paternal haplotype, also matching the haplotype alignment above. The nearest maternal hapmer is about 76 kb away. Maternal reads anchored there do not contain the MAT_T call, but paternal reads that disagree with the reference there do contain this T call, giving further support for a false positive phase switch.

Screen Shot 2023-02-15 at 3 19 52 PM

Screen Shot 2023-02-15 at 3 19 45 PM

I wonder if that short paternal sequence aligning had anything to do with StrandSeq issues related to short read mapping.

Screen Shot 2023-02-15 at 1 21 02 PM

Screen Shot 2023-02-15 at 1 21 09 PM

Interestingly, the first of these SNP calls lie at the boundary of an ATATAT --> GTGTGTG repeat.

Screen Shot 2023-02-15 at 1 28 17 PM

An earlier called switched SNP call appears at the boundary of a ATATATAT --> AGAGAGAG repeat.

Screen Shot 2023-02-15 at 1 55 29 PM

Even though the phasing appears correct: top group 1 matches maternal assembly on the left rather than containing T SNP and differs on the right rather than containing StrandSeq match, so are paternal. Reads on the bottom, group 2, match on the right, appearing maternal, but do not have much support for the MAT_T call. T is called with significant strand bias (9+, 21-), and only is called in ~25% of the reads, rather than an expected 50-50 if this were a true het SNP. Leading me to believe this call is a result of sequencing error, and not true difference between haplotypes.

Screen Shot 2023-02-15 at 2 03 33 PM

Screen Shot 2023-02-15 at 2 03 16 PM

However, the HiFi reads come closer to 50-50 (64-36 in favor of G over T), which makes it hard to dismiss, but should be taken with a grain of salt as we are seeing low quality reads in a very repetitive region.

Screen Shot 2023-02-15 at 2 13 29 PM

Trying to phase from the nearest maternal hapmer also does not paint a very clear picture. Not many reads contain the T SNP calls, and of the ones that do some appear maternal and some appear paternal. The alignment of the paternal haplotype doesn't show that it exists on the other haplotype either.

Screen Shot 2023-02-15 at 3 23 22 PM

Screen Shot 2023-02-15 at 3 23 17 PM

But DeepVariant PEPPER calls do show this as a true variant, so think I believe it.

Screen Shot 2023-02-16 at 9 22 47 AM Screen Shot 2023-02-16 at 9 22 36 AM
skoren commented 1 year ago

So the two variants at 75508358 and 75514087 look like strand-seq mistakes to me. Anchoring from the confirmed M downstream that has marker k-mers (e.g.unique k-mers in the asm) and looking in ONT reads clearly shows those reads which have the paternal haplotype also have the red T base: Screen Shot 2023-02-16 at 2 27 39 PM agreeing with the paternal haplotype of the asm. The second variant is not clear in ONT data, but we can link the two sites via hifi reads since they're close enough: Screen Shot 2023-02-16 at 2 29 39 PM here reads w/red base go to an insertion and a green base, exactly matching the paternal haplotype again. So together these convince me that strand-seq switching is wrong.

As for the upstream, there seems to be a missed variant that wasn't picked up at coordinate 75460936, the maternal hap matches here but there is no difference for the paternal haplotype when the variant looks real Screen Shot 2023-02-16 at 2 42 39 PM The actual mat_T call that has non 50/50 allele frequency looks real and a missed variant in the assembly (the maternal should have another T here). The reason for the weird allele frequency I think is the tandem repeat. There are some reads which insert/delete bases in the alignment in a way that they are equivalent to a T call but the alignment makes them miss it. There is also an insertion in the paternal haplotype nearby (extra TA) which I don't think is real.

skoren commented 1 year ago

Just attaching the example I was talking about on the call: Screen Shot 2023-02-17 at 3 05 13 PM

Note how the reads w/the T variant are much lower frequency (bin 2 w16 vs 29 reads) but there are a bunch of reads with an upstream 2 bp insertion of an AT followed by a deletion of the first GA:

sequence w/insertion followed by deletion: TGTATATATATATATATATA**T**AGAGAGAGA
reference sequence:                        TGTATATATATATATATATA**G**AGAGAGAGA

Note that it's identical to a T substitution in the lower frequency variant.

msikic commented 1 year ago

I agree, this change from TATATATAT to AGAGAGA has bothered me from the beginning. Often there could be a problem with dynamic programming tools. Recently we have had a separate discussion about the situation when I/D followed by matches and D/I could be due to wrongly chosen alignment paths. Therefore it is better to avoid situation when mismatch and gap open + first extend are equally penalized.

Also, I have found in several situations here that the ratio of variations at specific positions for ONT reads is 25/75

nhansen commented 1 year ago

In the end, corrections were made at 75465679 and 75531715, and alignments look clean:

image