marbl / verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
304 stars 29 forks source link

Confirm ChrX CHM1 gap is fixed #2

Closed skoren closed 2 years ago

skoren commented 3 years ago

ChrX in CHM1 is split by a 1.5kb gap spanned by ONT reads. Confirm latest version fixes the gap.

skoren commented 2 years ago

Still present at approximately 34 Mbp in compressed space, check in corrected graph.

skoren commented 2 years ago

This still appears to be an issue in my latest run w/corrected reads. I see reads mapping across a gap in the HiFi resolved graph indicating an overlap but the nodes are not connected in the final graph. The run is on biowulf under /data/korens/test/mikko_cns/chm1/snakemake/bri_asm/. The details on the nodes are below:

Candidate joining reads:

17989a43-e3cf-44c8-aaa5-e71cda54fb55
4e7c0ea0-f6ba-4ab9-ab0a-a0d0d601975c
7f3ab207-1829-4b2a-b683-a134ba3efce0
8075751a-bdb3-491e-aa7e-937618a16634
c94415bf-0411-4287-8bd3-4929e1e58c50
cffbe44a-0f54-40c2-b934-8a3bd8371eae
d3c46a82-e680-4d7a-a222-7fd184a9d42f
e0f42659-f2ce-41d1-9082-611c509ff470

Mappings in the all.gaf file:

4e7c0ea0-f6ba-4ab9-ab0a-a0d0d601975c    26583   151     17652   +       >utig1-69059    1122934 1105404 1122934 17322   17638   60      NM:i:316        AS:f:15396.4    dv:f:0.0179159  id:f:0.982084
d3c46a82-e680-4d7a-a222-7fd184a9d42f    39635   10990   39628   +       <utig1-69059    1122934 0       28762   28195   29026   60      NM:i:831        AS:f:23103.5    dv:f:0.0286295  id:f:0.97137
d3c46a82-e680-4d7a-a222-7fd184a9d42f    39635   21      12523   +       <utig1-63255>utig1-63254>utig1-63254>utig1-63254        251460  238870  251454  12295   12706   60      NM:i:411        AS:f:9764.74    dv:f:0.0323469  id:f:0.967653
17989a43-e3cf-44c8-aaa5-e71cda54fb55    24149   24      16805   +       >utig1-69059    1122934 1106055 1122934 16538   17018   60      NM:i:480        AS:f:13584.2    dv:f:0.0282054  id:f:0.971795
cffbe44a-0f54-40c2-b934-8a3bd8371eae    50294   14521   50282   +       <utig1-63254<utig1-63254<utig1-63254>utig1-63255        251460  7       35853   35332   36114   60      NM:i:782        AS:f:30552.9    dv:f:0.0216537  id:f:0.978346
cffbe44a-0f54-40c2-b934-8a3bd8371eae    50294   25      16118   +       >utig1-69059    1122934 1106783 1122934 15875   16273   60      NM:i:398        AS:f:13442.3    dv:f:0.0244577  id:f:0.975542
c94415bf-0411-4287-8bd3-4929e1e58c50    60410   27393   60399   +       <utig1-63254<utig1-63254<utig1-63254>utig1-63255        251460  7       33067   32455   33416   60      NM:i:961        AS:f:26605.7    dv:f:0.0287587  id:f:0.971241
c94415bf-0411-4287-8bd3-4929e1e58c50    60410   24      28966   +       >utig1-69059    1122934 1093899 1122927 28539   29261   60      NM:i:722        AS:f:24133.5    dv:f:0.0246745  id:f:0.975326
e0f42659-f2ce-41d1-9082-611c509ff470    43701   27      27822   +       <utig1-63255>utig1-63254>utig1-63254    251438  223521  251420  27366   28157   60      NM:i:791        AS:f:22526.9    dv:f:0.0280925  id:f:0.971908
e0f42659-f2ce-41d1-9082-611c509ff470    43701   26283   43701   +       <utig1-69059    1122934 0       17552   17104   17714   60      NM:i:610        AS:f:13355.4    dv:f:0.034436   id:f:0.965564
7f3ab207-1829-4b2a-b683-a134ba3efce0    94860   25242   94860   +       <utig1-63254<utig1-63254<utig1-63254>utig1-63255        251460  7       69967   68899   70386   60      NM:i:1487       AS:f:59714.6    dv:f:0.0211264  id:f:0.978874
7f3ab207-1829-4b2a-b683-a134ba3efce0    94860   22      26801   +       >utig1-69059    1122934 1096091 1122916 26549   26960   60      NM:i:411        AS:f:24041.7    dv:f:0.0152448  id:f:0.984755
8075751a-bdb3-491e-aa7e-937618a16634    30537   31      20780   +       >utig1-69059    1122934 1102068 1122934 20427   21039   60      NM:i:612        AS:f:16673.1    dv:f:0.0290888  id:f:0.970911
8075751a-bdb3-491e-aa7e-937618a16634    30537   19213   30537   +       <utig1-63254<utig1-63254<utig1-63254>utig1-63255        251460  7       11333   11006   11526   60      NM:i:520        AS:f:7860.8     dv:f:0.0451154  id:f:0.954885

Node utig1-63254 ends up part of utig4-4815 while node utig1-69069 ends up part of utig4-5641. The mapping to the chrX reference with mashmap shows these are adjacent with a small overlap (consistent with mappings in gaf):

utig4-5641 34416084 34390000 34416083 - chrY 44220853 719 29866 99.2108
utig4-5641 34416084 32690000 34389999 - chrY 44220853 54721 1757405 99.4759
utig4-5641 34416084 31920000 33469999 - chrX 107944777 936940 2484527 99.686
utig4-5641 34416084 31450000 31929999 - chrX 107944777 2451309 2930944 99.745
utig4-5641 34416084 29130000 31449999 - chrX 107944777 3157132 5471779 99.8401
utig4-5641 34416084 1110000 29119999 - chrX 107944777 5477993 33496347 99.8846
utig4-5641 34416084 1080000 1129999 + chrX 107944777 33477123 33525432 98.774
utig4-5641 34416084 430000 1099999 - chrX 107944777 33506048 34183720 99.9217
utig4-5641 34416084 410000 429999 - chrX 107944777 34170381 34189293 100
utig4-5641 34416084 340000 389999 - chrX 107944777 34206235 34265858 99.9638
utig4-5641 34416084 0 329999 - chrX 107944777 34277704 34607548 99.9739

utig4-4815 251394 0 251393 - chrX 107944777 34605476 34856848 99.9717

However, these nodes are in two disconnected components in the graph. Would be good to trace why these nodes aren't being joined/resolved by the ONT step.

skoren commented 2 years ago

There is also a second gap 3mb away in the latest corrected assembly that was not present in my, admittedly old, uncorrected assembly (/data/rautiainenma/CHM1_test_20210930). The gap is in the HiFi graph between nodes utig1-25789 and utig1-34476. I only see 1 ONT read joining them in the alignments (db77a086-4d9d-48dd-9454-e80301a870bf). However, in the old run there is another read (55419a7f-5227-4546-8345-d6de25abe0b5) which connects the two dead ends. In the latest run, the second read just has half of it unmapped as far as I can see. Not sure why the mapping is lost. Here are the mappings in the old run:

db77a086-4d9d-48dd-9454-e80301a870bf  121561  36800   121552  +       >19541  561967  0       84780   83039   85875   60      NM:i:2836       AS:f:75308.1    dv:f:0.0330247        id:f:0.966975
db77a086-4d9d-48dd-9454-e80301a870bf  121561  66      33671   +       <70316  110851  77220   110851  32893   34066   60      NM:i:1173       AS:f:29698.9    dv:f:0.0344332        id:f:0.965567

55419a7f-5227-4546-8345-d6de25abe0b5  82816   39465   82800   +       >19541  561967  0       43295   42255   43775   60      NM:i:1520       AS:f:38273.4    dv:f:0.034723 id:f:0.965277
55419a7f-5227-4546-8345-d6de25abe0b5  82816   61      36342   +       <70316  110851  74488   110851  35563   36739   60      NM:i:1176       AS:f:32364.9    dv:f:0.0320096        id:f:0.96799

and in the new run:

db77a086-4d9d-48dd-9454-e80301a870bf    121561  36800   121553  +       <utig1-34476    561967  0       84781   83138   85834   60      NM:i:2696    AS:f:66797.6     dv:f:0.0314095  id:f:0.968591
db77a086-4d9d-48dd-9454-e80301a870bf    121561  66      33671   +       >utig1-25789    110849  77220   110849  32946   34043   60      NM:i:1097    AS:f:26299       dv:f:0.032224   id:f:0.967776

55419a7f-5227-4546-8345-d6de25abe0b5    82816   61      36342   +       >utig1-25789    110849  74488   110849  35703   36711   60      NM:i:1008    AS:f:29567.7     dv:f:0.0274577  id:f:0.972542
skoren commented 2 years ago

Resolved in beta version