marbl / HG002-issues

HG002 human reference genome issue tracking and polishing
10 stars 0 forks source link

rDNA replaced by N's at chr13_MATERNAL:3641000-9777500 #649

Open nhansen opened 10 months ago

nhansen commented 10 months ago

Have you confirmed that this issue hasn't already been reported?

Issue location in assembly (use format chromosome:start-end, e.g., chr13_MATERNAL:3740148-9625296)

chr13_MATERNAL:3641000-9777500

Description of the issue

The assembly has N's in place of accurate rDNA array sequence, as well as suspect sequence on the edges of this region.

jzook commented 8 months ago

Based on DV variants in HiFI and ONT, there also appear to be some errors before and after this in the rDNA array, though there are more due to high coverage on the right side, so less sure about those. Maybe should exclude/fix between 3738000 and through 9710000? @nhansen maybe it makes sense to modify the coordinates of this issue rather than adding new issues?

Screenshot 2024-01-20 at 9 39 12 PM Screenshot 2024-01-20 at 10 58 55 PM
nhansen commented 8 months ago

@jzook what were the locations/coverage values of the DeepVariant calls you're trusting here? Near the rDNAs, I would not trust any DeepVariant calls that are made using alignments which don't stretch into regions clearly outside the rDNA array, since so many reads are falsely aligned due to the missing rDNA copies in v1.0.1.

The consensus here was called by @Dmitry-Antipov using only alignments of reads aligning outside the rDNA region, and I curated each region to be sure the consensus is backed up by validly-aligned ONT reads, so I'm hesitant to label these regions as suspect without looking more carefully. In addition, @steven-solar and I are working on filtering bam files to include only the alignments that are unambiguously aligned to the correct region, so that errors will be more obvious.

Let me know what you think--thanks!

jzook commented 8 months ago

@nhansen ah, I'd forgotten you'd curated these regions. These regions do have high coverage so I don't trust most of the variant calls, and it's probably fine to keep them since you curated them.

However, it does look like there might be some small indel errors, which might be because you didn't correct these from ONT?. E.g, there's a 1bp ins supported by practically all UL ONT reads (as well as many duplex and HiFi reads) at chr13_MATERNAL:3,739,869 Total count: 173 A : 0 C : 0 G : 172 (99%, 90+, 82- ) T : 1 (1%, 0+, 1- ) N : 0

DEL: 0 INS: 169

and probably a 2bp ins here as well, though alignments are noiser due to a TR: chr13_MATERNAL:3,739,647 Total count: 178 A : 0 C : 1 (1%, 0+, 1- ) G : 0 T : 177 (99%, 96+, 81- ) N : 0

DEL: 1 INS: 127

Dmitry-Antipov commented 8 months ago

using only alignments of reads aligning outside the rDNA region

To be more precise, I've used ONT reads aligning outside the rDNA region to extract the correct path in homopolymer-compressed graph constructed from hifi, and then used only those hifi reads for consensus that do not contradict (in homopolymer-compressed space) with this path.

So, if there are multiple copies of rDNA that differ only in homopolymer frequency, that's possible that I've recruited Hifi reads from the "wrong" rDNA repeat copy too. Also I can imagine scenario when hifi read correction step could erroneously "glue" two very slightly different rDNA copies together, and then in final verkko's consensus step "wrong" reads can be also used.

nhansen commented 5 months ago

I've decided to widen these rDNA-associated issue regions so they won't be used for benchmarking assemblies.