maickrau / MBG

MIT License
57 stars 2 forks source link

potential off-by-one in v1.0.3 #1

Closed ptrebert closed 3 years ago

ptrebert commented 3 years ago

Hi Mikko,

I created a gfa using MBG v1.0.3 (via bioconda), and GraphAligner aborts the subsequent alignment immediately with the following message:

Error in the graph: Overlap between nodes 916045 and 916046 is too big. Fix the overlap to be smaller than both nodes

The dataset is quite large, but maybe not needed because it is a simple off-by-one error?

>NODE_916046+_length_2557_cov_48.5456
>NODE_916045+_length_1449_cov_13
L       916045  -       916046  +       1450M   ec:i:13
L       916046  -       916045  +       1440M   ec:i:13

Best, Peter

maickrau commented 3 years ago

This happens sometimes when the homopolymer length consensus picks different lengths for the two sides of an edge. eg, there's a short node with just one minimizer, and the consensus picks a short homopolymer length in that node, but a longer length on the other end of a neighboring node, leading to the overlap being longer than the short node. This is also the reason why the overlap lengths are not identical between the two edge lines.

You can sidestep this by using the parameter --blunt which will create a graph without edge overlaps. After that you should clean the graph with vg. If you're building from contigs instead of reads you might also try --no-hpc to disable homopolymer consensus.

ptrebert commented 3 years ago

Thanks for the explanation. Do you happen to have any type of empirical recommendations/best practices for cleaning a blunt graph built from HiFi reads?

maickrau commented 3 years ago

I've used vg: vg view -Fv graph.gfa | vg mod -n -U 100 - | vg view - > blunt-graph.gfa

It will produce graphs with reasonable topologies but it will also remove coverage information from the nodes.

ptrebert commented 3 years ago

Thanks

maickrau commented 3 years ago

This is fixed in MBG v1.0.4