marbl / HG002-issues

HG002 human reference genome issue tracking and polishing
10 stars 0 forks source link

13kb false duplication at chr10_MATERNAL:43,216,000-43,245,000 #662

Open nhansen opened 10 months ago

nhansen commented 10 months ago

Have you confirmed that this issue hasn't already been reported?

Issue location in assembly (use format chromosome:start-end, e.g., chr13_MATERNAL:3740148-9625296)

chr10_MATERNAL:43,216,000-43,245,000

Description of the issue

Sniffles calls on v0.9 picked up a large deletion in this region, and long read data bear it out. Here's an IGV screenshot:

image
jzook commented 9 months ago

I just came across this when looking at DV calls in both HiFi and ONT as well, and looks like it is annotated as a segdup between HSats in the browser. Some HG2 and HG4 HiFi reads align across it so should be possible to correct based on HiFi sequence as well as ONT

jzook commented 7 months ago

One thing I just noticed when looking at NateD's stratifications is that the 11kb region chr10_MATERNAL 43232301 43243056 is a pure C homopolymer! This made chr10 a huge outlier for long C homopolymers :)

nhansen commented 7 months ago

Wow--tagging @skoren so he can possibly figure out where all those C's came from! Just to be picky, though, it's not pure C's for the whole 11kb. There are actually non-C bases at about seven spots, giving eight very long mononucleotide runs!

image
nhansen commented 7 months ago

As you point out, Justin, it should be fairly easy to re-call consensus for this stretch to create a patch.

jzook commented 7 months ago

ah, I forgot that we merged nearby perfect homopolymers to get this region, so that makes sense that there could be small (<10bp) interruptions between 21+bp perfect homopolymers