rrwick / Bandage

a Bioinformatics Application for Navigating De novo Assembly Graphs Easily
http://rrwick.github.io/Bandage/
GNU General Public License v3.0
579 stars 96 forks source link

Read Coverage vs. Read Count, when nodes exporting to .gfa from LastgGraph. #21

Closed Omig12 closed 8 years ago

Omig12 commented 8 years ago

Hi Ryan,

First of all many I'd like to thanks you for this awesome tool. Now to the issue, while working on a LastGraph to GFA parser, when utilizing Bandage 7.1 to corroborate the specification of the connected components reported by the parser, we noticed that when exporting selected nodes from LastGraph file to GFA, the output file generated by Bandage reports appears to be an incorrect value in the GFA sequence's read count optional header.

From what I could gather the GFA sequence's read count (RC:i:###) is being retrieved using LastGraph's _$COVSHORT1 value from the NODE header, instead of the _$NUMBER_OF_SHORTREADS value from the NR header.

LastGraph NODE $NODE_ID $COV_SHORT1 $O_COV_SHORT1 $COV_SHORT2 $O_COV_SHORT2

NR $NODE_ID $NUMBER_OF_SHORT_READS

GFA S RC i # reads assembled into the segment

L/S RC i # reads that support the segment/link

I'm an undergrad student currently working on my first semester of Bioinformatics research, so I might be mistaking the terms used in the documentation for the file formats. Anyways, I hope this is a useful contribution.

rrwick commented 8 years ago

You're definitely right - Bandage has been a bit screwy regarding node depths and GFA.

Side note 1: I have found 'coverage' to be a confusing term, as in other bioinformatics contexts it means 'how much of a sequence was found' or something similar. So in Bandage I had previously used 'read depth' instead, even though assemblers often refer to the same thing as 'coverage'.

When loading a Velvet LastGraph file, Bandage doesn't actually read the NR lines because they aren't always present (I think you need to use -read_trkg yes to get those). So the actual read count is ignored by Bandage - it only uses the$COV_SHORT1 value in the node lines. I think that when Velvet and SPAdes (the two assemblers I'm most familiar with) report a depth (or coverage or whatever you call it), they are really reporting a k-mer depth, not a read depth.

Side note 2: I'm finalising a new release of Bandage now, and this reminded me of the confusion surrounding 'coverage' vs 'read depth' vs 'depth' etc. So I renamed all instances of 'Read depth' in Bandage to just plain 'Depth'. That's more general and hopefully less confusing.

So to improve the issue, I have tried to make Bandage more intelligent regarding how it saves depth in GFA files. It can use RC, KC or FC tags or nothing at all. For a Velvet/SPAdes graph, it will use the KC tag (instead of RC) as I think that's most accurate. If you loaded a GFA graph, Bandage will remember which tag was used in the original file and use the same tag when saving to GFA. And if the file you loaded didn't have any depth info (e.g. when loading a Trinity graph), it won't save any depth info in the GFA.

These changes are now in the development branch of Bandage, so you can build Bandage from source if you'd like access to them. But I'm hoping to have the next release finalised very soon (within a day or two if I don't run into any problems), so it should be available in the master branch and in precompiled binaries soon.

Thanks so much for bringing it up! It reminded me that there were some rough edges I had to smooth out!

rrwick commented 8 years ago

Bandage v0.8.0 is now available and includes these changes to depths in GFA files.