maickrau / GraphAligner

MIT License
256 stars 30 forks source link

Overlap between nodes is too big #45

Closed fjruizruano closed 3 years ago

fjruizruano commented 3 years ago

Hi,

I am trying to map 4 ONT reads against a graph generated with hifiasm with this command: $ GraphAligner -g hifiasm.hic.p_utg.gfa -f nanopore_4reads.fastq -a map.gaf -x vg

However, I stops after few seconds with this message:

GraphAligner Branch master commit 02c8e2628bba16425dc58cdf67199319f0a7a304 2021-04-23 09:06:36 -0400
GraphAligner Branch master commit 02c8e2628bba16425dc58cdf67199319f0a7a304 2021-04-23 09:06:36 -0400
Load graph from hifiasm.hic.p_utg.gfa
Error in the graph: Overlap between nodes utg011657l and utg010083l is too big. Fix the overlap to be smaller than both nodes
Error in the graph: Overlap between nodes utg011657l and utg010083l is too big. Fix the overlap to be smaller than both nodes

Any help to solve this issue?

Thanks in advance, Francisco.

maickrau commented 3 years ago

Hi, this happens when there is an edge between two nodes with an overlap that is as big (or bigger than) one of the nodes, so the graph is not valid. You can find the lengths of the nodes with:

grep -P '^S\tutg011657l\t' < hifiasm.hic.p_utg.gfa | cut -f 3 | wc -c grep -P '^S\tutg010083l\t' < hifiasm.hic.p_utg.gfa | cut -f 3 | wc -c

and the overlap with:

grep -P '^L' < hifiasm.hic.p_utg.gfa | grep -P '\tutg010083l\t' | grep -P '\tutg011657l\t'

The overlap (sixth column) should be smaller than both nodes. You can try to manually fix the edge overlap to be smaller than both nodes (eg. one less than the smaller node). This might bias the alignment slightly in that region.

fjruizruano commented 3 years ago

Thanks a lot! I will try it. Best.

dcopetti commented 1 year ago

Hello, I am getting the same error, but the overlap is smaller than the length of both nodes. I start from a Hifiasm *p_utg_edit.gfa file, then run GraphAligner -t 100 -g assembly.p_utg.gfa -f subreads.fa.gz -a aln_reads.gaf -x dbg and I get: Error in the graph: Overlap between nodes utg007275l and utg023862l fully contains one of the nodes. Fix the overlap to be strictly smaller than both nodes so I do:

$ grep  'utg007275l' < Cjam_HiC_Hifi_CCS-combined-hifiasm-l3.hic.p_utg_edit.gfa | cut -f 3 | wc -c
113841
$ grep  'utg023862l' < Cjam_HiC_Hifi_CCS-combined-hifiasm-l3.hic.p_utg_edit.gfa | cut -f 3 | wc -c
15631
$ grep -P '^L' <Cjam_HiC_Hifi_CCS-combined-hifiasm-l3.hic.p_utg_edit.gfa| grep -P 'utg023862l' | grep -P 'utg007275l'
L       utg007275l      +       utg023862l      +       15613M  L1:i:98036
L       utg023862l      -       utg007275l      -       15609M  L1:i:3

is that correct? in both lines the 6th column has a M value that is lower than the shortest node

Also, I am aligning PacBio CLR reads (around 90% accurate, 20-100 kb in length), do I need to change some parameter for the lower accuracy? Thanks!

maickrau commented 1 year ago

Could you share the graph?

vpymerel commented 1 year ago

@dcopetti if you find/found a solution, I am also trying to use graphaligner on hifiasm output and encountering similar porblems ...

dcopetti commented 1 year ago

Hello all, Apologies I had to focus on other projects. After hearing the feedback on this and #81, I will go for the Verkko option, as suggested.