marbl / verkko

265 stars 27 forks source link

processGraph "assert edge[0] not in node_cuts" #230

Closed grinning-bat closed 4 months ago

grinning-bat commented 4 months ago

Hi! I've probably met #147 again.

processGraph failed several times (first it was not activating envo, so unrelated). Being rerun manually gives some lines of: can't fix >62706 >62705 due to overlap (wanted 112, overlap 1686) interspersed among mend <37326 <45168

then:

Traceback (most recent call last):
  File "[redacted]/verkko/env/lib/verkko/scripts/fix_haplogaps.py", line 231, in 
    assert edge[0] not in node_cuts

We are trying to assemble repeat-rich genome of unknown size. Judging from ONT-only assembly size PacBio coverage should be 10x. ONT is also around 10x buildGraph.err lists: estimated average coverage 11.7946 assembly size 1071615368 bp, N50 15182 size is surprisingly 3-4x smaller than expected.

Versions:

Launching bioconda verkko bioconda 2.0
Using snakemake 7.32.4.

MBG reported by conda: 1.0.16 hdcf5f25_0
MBG installed manually through conda, can't tell actual version as >MBG --version Gives:

MBG Branch  commit
Version: Branch  commit
skoren commented 4 months ago

Can you share your 1-buildGraph and 2-processGraph folders (https://canu.readthedocs.io/en/latest/faq.html#how-can-i-send-data-to-you)?

grinning-bat commented 4 months ago

Done:

63e8a64b010d71d081489783f01d0782  build_graph_1426.tgz
24a723bf5e324064d2d12103aed28414  process_graph_1426.tgz
skoren commented 4 months ago

@maickrau it seems there's a node in the graph that's getting cut (59317) but it still has an edge to it in extra_nocut_edges. So first, the edge is added to and then the node is cut on a subsequent iteration. It seems like the graph is reasonable if you just skip adding the edge (e.g. continue rather than assert). I posted the data on globus under issue230, can you take a look?

maickrau commented 4 months ago

The haplotype gap fixing had reads supporting two different fixes at the het nodes and the fixes were not compatible, so it crashed when it tried to apply both fixes. Commit 91070e8 now checks if it tries to include conflicting fixes and only includes one of them.

Also in general the graph looks very fragmented, probably due to low HiFi coverage. It seems like the average coverage per haplotype might be around 5x or even lower, so getting around 3 times more HiFi data would probably improve the assembly a lot.

grinning-bat commented 4 months ago

Tried it, processGraph works now. Great! Thanks a lot! We aimed at 10x HiFi and 10x ONT, but intracellular parasites steal too much coverage. Probably we will do some additional sequencing indeed.