rrwick / Bandage

a Bioinformatics Application for Navigating De novo Assembly Graphs Easily
http://rrwick.github.io/Bandage/
GNU General Public License v3.0
591 stars 98 forks source link

non-spades FASTG is not supported #10

Open d-cameron opened 9 years ago

d-cameron commented 9 years ago

The FASTQ parsing is currently tightly coupled with spades output. Simple nodes with no intra-sequence branching fail to load. It appears this is due to shapes-specific parsing of node names and explicit checks for + and - in assemblygraph.cpp

#FASTG:begin:version=1.0;
>ACCCTAACCCTAACCCTAACCCTAA_10020:CCCTAACCCTAACCCTAACCCTAAC_10021:start=10020,end=10020,weight=38,reference=1;
ACCCTAACCCTAACCCTAACCCTAA
>CCCTAACCCTAACCCTAACCCTAAC_10021:start=10021,end=10021,weight=38,reference=1;
CCCTAACCCTAACCCTAACCCTAAC
#FASTG:end;

Are there any plans to support any additional file formats (such as paired .fasta/.dot as used by ABySS )? What format would you recommend new assemblers export to?

Cheers Daniel

rrwick commented 9 years ago

Daniel,

That's correct - the FASTG output accepted by Bandage is really the SPAdes flavour of FASTG, which is conveniently also used by the MEGAHIT assembler. It uses node names which look like this: NODE_1_length_6070_cov_43.3434. Bandage chops up the name using the underscores and expects to find the node number in the second position and the node coverage in the sixth. So yes, that's why your example file won't load.

I am definitely interested in expanding Bandage to work with more formats. I could look into non-SPAdes FASTG files, though they don't appear to be widely used. ABySS has also been on my list of formats to check out, but I haven't yet. I am also interested in GFA, though the format doesn't appear to be finalised. And so I'm afraid I don't have a solid recommendation for you. You could either mimic an existing format that Bandage supports (like SPAdes-flavoured FASTG) or use a different one. Adding Bandage support isn't hard, as long as these features are clear:

Ryan

fxia22 commented 9 years ago

@rrwick Hi Ryan, I am interested in implementing support for GFA format and barcoded FASTG format (something I am working on). And I am willing to contribute.

I think supporting GFA will be great as we can visualize results of string graph-based assemblers then. What do you think?

Fei

rrwick commented 9 years ago

Fei,

Bandage now has GFA support! Give it a try and let me know if you have any issues. GFA does seem to be the most-used graph format out there, so I'm happy to encourage others to use it as well: https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md

What about barcoded FASTG? Could you explain that format to me?

Ryan

fxia22 commented 9 years ago

Ryan,

I just checked that out. The GFA looks great! That's my main concern. barcoded FASTG is about my own project and the specification is not finalized yet. The idea is that 10x genomics sequencing has a barcode for each read, we wanted to show barcodes on the assembly graph. (That could help untangle the graph, clean spurious edges, scaffold contigs, etc) I do have a demo.

However, I guess that is just a minority need so don't bother adding it to main branch.

Fei

rrwick commented 9 years ago

That looks cool! Keep me posted as you develop further - I'd like to have a play with it.