maickrau / GraphAligner

MIT License
256 stars 30 forks source link

Support GFA with sequences from a FASTA file [feature request] #12

Closed mrvollger closed 4 years ago

mrvollger commented 4 years ago

I think it would be great if the sequences of the GFA file could be contained in a fasta file with the same prefix.

Basically if the GFA has * in place of the sequence tag look for a .fasta or .fa with the same prefix and load in the sequence from that.

This is basically the same request as described here: https://github.com/rrwick/Bandage/issues/23

Thanks! Mitchell

ekg commented 4 years ago

I think this would warrant a script to out the FASTA sequences in the GFA. What's the advantage of keeping them separate?

On Fri, Dec 6, 2019, 22:37 Mitchell Robert Vollger notifications@github.com wrote:

I think it would be great if the sequences of the GFA file could be contained in a fasta file with the same prefix.

Basically if the GFA has * in place of the sequence tag look for a .fasta or .fa with the same prefix and load in the sequence from that.

This is basically the same request as described here: rrwick/Bandage#23 https://github.com/rrwick/Bandage/issues/23

Thanks! Mitchell

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/maickrau/GraphAligner/issues/12?email_source=notifications&email_token=AABDQEKOCDZS3Y5CHITWUGLQXLA3PA5CNFSM4JXB2K52YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H6YJVOQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEJYPMMPPKMLBZU3KDTQXLA3PANCNFSM4JXB2K5Q .

mrvollger commented 4 years ago

Hi Erik,

I was thinking for the input graph not any output. For example when Canu generates a GFA it's "S" lines just have "*" for seqs and there is an matching fasta file that contains the sequences. I think it would be useful to be able to directly use Canu's output without having to run an additional script to make a GFA that has the seq fields populated.

Does that make sense?

Best, Mitchell

ekg commented 4 years ago

In my experience, Canu's output isn't even correct. It's overlaps aren't accurate and must be recalculated.

I think a preprocessing step is much nicer than putting this into GraphAligner.

Alternatively, Canu could produce complete and correct GFA output. Have you asked them about this?

On Sat, Dec 7, 2019, 15:06 Mitchell Robert Vollger notifications@github.com wrote:

Hi Erik,

I was thinking for the input graph not any output. For example when Canu generates a GFA it's "S" lines just have "*" for seqs and there is an matching fasta file that contains the sequences. I think it would be useful to be able to directly use Canu's output without having to run an additional script to make a GFA that has the seq fields populated.

Does that make sense?

Best, Mitchell

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maickrau/GraphAligner/issues/12?email_source=notifications&email_token=AABDQEIMGEEI7V4BWDITKFLQXOUX7A5CNFSM4JXB2K52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGGHR3I#issuecomment-562854125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEIDIOXSBUQDBZQMC23QXOUX7ANCNFSM4JXB2K5Q .

mrvollger commented 4 years ago

Fair enough. However, given that multiple assemblers (Canu, Abyss) output GFAs with * and that is valid by the GFA spec, I think there should be some error throw when the input GFA is missing the seq field. The current output is:

GraphAligner bioconda 1.0.10-
GraphAligner bioconda 1.0.10-
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 18446744073709551615) > this->size() (which is 3)
Aborted (core dumped)

and it took me a while to track down the input issue. Would you agree with that?

Also out of curiosity, do you know if the issues you just mentioned with the Canu GFA also extend to the {prefix}.unitigs.gfa file? I knew that {prefix}.contigs.gfa had issues and it was removed in 1.9 but not that unitigs might have issues.

maickrau commented 4 years ago

Hi Mitchell,

The crash is definitely a bug and it should output something more helpful. I'd prefer a preprocessing approach to handle the * nodes. Do you know if any tools other than Canu & Abyss output this kind of graph?

maickrau commented 4 years ago

Commit 0b38cec prints a more helpful error message for nodes with *

mrvollger commented 4 years ago

Thanks! I am not aware of any other assemblers that use the * but there could be.

mrvollger commented 4 years ago

FYI, I have found that FALCON also uses a "*" in place of contigs in asm.gfa, but I think the error message is a good solution so I will close this.