Open nhansen opened 10 months ago
Based on DV variants in HiFI and ONT, there also appear to be some errors before and after this in the rDNA array, though there are more due to high coverage on the right side, so less sure about those. Maybe should exclude/fix between 3738000 and through 9710000? @nhansen maybe it makes sense to modify the coordinates of this issue rather than adding new issues?
@jzook what were the locations/coverage values of the DeepVariant calls you're trusting here? Near the rDNAs, I would not trust any DeepVariant calls that are made using alignments which don't stretch into regions clearly outside the rDNA array, since so many reads are falsely aligned due to the missing rDNA copies in v1.0.1.
The consensus here was called by @Dmitry-Antipov using only alignments of reads aligning outside the rDNA region, and I curated each region to be sure the consensus is backed up by validly-aligned ONT reads, so I'm hesitant to label these regions as suspect without looking more carefully. In addition, @steven-solar and I are working on filtering bam files to include only the alignments that are unambiguously aligned to the correct region, so that errors will be more obvious.
Let me know what you think--thanks!
@nhansen ah, I'd forgotten you'd curated these regions. These regions do have high coverage so I don't trust most of the variant calls, and it's probably fine to keep them since you curated them.
DEL: 0 INS: 169
DEL: 1 INS: 127
using only alignments of reads aligning outside the rDNA region
To be more precise, I've used ONT reads aligning outside the rDNA region to extract the correct path in homopolymer-compressed graph constructed from hifi, and then used only those hifi reads for consensus that do not contradict (in homopolymer-compressed space) with this path.
So, if there are multiple copies of rDNA that differ only in homopolymer frequency, that's possible that I've recruited Hifi reads from the "wrong" rDNA repeat copy too. Also I can imagine scenario when hifi read correction step could erroneously "glue" two very slightly different rDNA copies together, and then in final verkko's consensus step "wrong" reads can be also used.
I've decided to widen these rDNA-associated issue regions so they won't be used for benchmarking assemblies.
Have you confirmed that this issue hasn't already been reported?
Issue location in assembly (use format chromosome:start-end, e.g., chr13_MATERNAL:3740148-9625296)
chr13_MATERNAL:3641000-9777500
Description of the issue
The assembly has N's in place of accurate rDNA array sequence, as well as suspect sequence on the edges of this region.