rrwick / Bandage

a Bioinformatics Application for Navigating De novo Assembly Graphs Easily
http://rrwick.github.io/Bandage/
GNU General Public License v3.0
579 stars 96 forks source link

Overlaps counted twice in "Total length (no overlaps)" calculation? #111

Open jflot opened 2 years ago

jflot commented 2 years ago

I am wondering about how Bandage calculates the "Total length (no overlaps)" statistics. For example here is a toy GFA with only four contigs and two links: H VN:Z:1.0 S contig1 AAAAAAAAAA S contig2 AAAAACCCCC S contig3 GGGGGGGGGG S contig4 GGGGGTTTTT L contig1 + contig2 + 5M L contig3 + contig4 + 5M

Each contig is 10 bp long, and the total length without overlaps should be (in my opinion) 30 bp but Bandage tells 20 bp, i.e. it seems that each overlap is counted twice. After using the "Merge all possible nodes" tool, however, the total length becomes 30 bp as expected.

Another example (with one extra contig and one extra link): H VN:Z:1.0 S contig1 AAAAAAAAAA S contig2 AAAAACCCCC S contig3 GGGGGGGGGG S contig4 GGGGGTTTTT S contig5 GGGGGGGGGT L contig1 + contig2 + 5M L contig3 + contig4 + 5M L contig3 + contig5 + 9M

Here the total length (no overlaps) returned by Bandage is 17 bp... Any cue?

odethier-ulb commented 2 years ago

overlap I think that the length is correct. For instance if we take the second example, the total length without overlap is computed by summing the 'white' part of each sequence (the coloured ones are the overlaps), which gives 17.