marbl / HG002-issues

HG002 human reference genome issue tracking and polishing
10 stars 0 forks source link

Issue: chr18_MATERNAL:18002949-18018878 #262

Closed nhansen closed 1 year ago

nhansen commented 1 year ago

Assembly Region

chr18_MATERNAL:18002949-18018878

Assembly Version

v0.7

ont_evidence subregions

18002949-18018878 (Low)

hifi_evidence subregions

18007869-18014634 (Low)

AndreaGuarracino commented 1 year ago

At least part of the region is annotated as alpha satellite sequence, so we can expect difficulties in aligning reads here. There is a little hole in HiFi coverage (highlighted by the vertical black line in the screenshot), but if we look at the HiFi reads aligned to only MATERNAL+Y+ENV, there are HiFi reads that would align through the hole. However, those HiFi reads, that might have been used for assembling the PATERNAL haplotype, do not fully support the corresponding PATERNAL region (next screenshot for more detail).

image

For example, I can see the 342bp deletion on the right in both the HiFi reads covering the hole and the PATERNAL haplotype, but I don't see in the same HiFi reads confirmation for the other deletions that the PATERNAL has.

image

Strangely, the ONT reads say totally different things. Moreover, if we consider ONT reads aligned against the diploid assembly, there are two little TT insertions on the right (1st screenshot) that are present in the PATERNAL haplotype, not in the MATERNAL. However looking at the ONT reads aligned against MATERNAL+Y+ENV, almost all reads present those two little insertions:

image

It seems that both MATERNAL and PATERNAL might need cleaning here.

skoren commented 1 year ago

Flagger has a pretty large region flagged here and I think this is related to #266 on the other haplotype. Definitely will need to trace back the ONT resolution here to see how verkko resolved it.

seryrzu commented 1 year ago

VerityMap shows a solid-k-mer desert of length 5535bp at chr18_MATERNAL:18007740-18013274 meaning that there are at least two flanking solid-k-mers around these coordinates.

Having no k-mers here results in theoretical impossibility to map HiFi reads here (with the current parameters of VerityMap), and indeed we observe a coverage gap that is consistent with HiFi alignments provided by Winnowmap.

What makes things even more tricky is that VerityMap doesn't have much primary alignments, only secondary, and these are typically less reliable.

Let's assume that there is no issue in the underlying assembly, then lack of coverage with even secondary alignments by both tools at chr18_MATERNAL:18007740-18013274 can be explained by either HiFI drop-out or a solid-k-mer desert.

There is k-mer desert just upstream of length 8842bp with coords chr18_MATERNAL:17986518-17995359. This region (albeit being longer), however, has secondary alignments by both tools. This somewhat rules out that there are no alignment due to a solid-k-mer desert.

HiFi drop-outs are usually associated with some micro-satellite enrichment and there is no evidence of such.

Winnowmap alignments of ONT mappings here are also deflated (although not to zero coverage).

Together last two points suggest that it is probably more likely that there is no HiFi coverage drop-out.

That, in turn, suggests that there might be an issue in the underlying assembly.

Additional evidence in favor of this is that there is no HiFI read that spans the desert chr18_MATERNAL:18007740-18013274 for both VerityMap and Winnowmap (including secondary alignments).

I agree with @skoren that investigating Verkko graph here would be helpful.

Screen Shot 2023-02-27 at 23 22 16
nhansen commented 1 year ago

The v0.8 assembly did a much better job with this region. We will patch v0.7 here.

image image