GFA Paths - Githubissues

fbemm commented 7 years ago

I just had a look at a pairwise alignment of two genomes. Reveal is able to identify shared segments but what I am actually missing in the GFA files are paths ("Path an ordered list of oriented segments, where each consecutive pair of oriented segments are supported by a link record." from https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md). For most downstream tools this is the most important information. But maybe I misunderstood reveal here. Any ideas?

jasperlinthorst commented 7 years ago

Indeed I don't output paths at the moment. Instead, for every node (or segment) I add some attributes that specify the genomes in which the segment was observed (and the location/offset within the genome/sequence, have a look at the ORI and OFFSET tags for each segment).

For example, consider the following piece of gfa: H ORI:Z:g1.fasta;g2.fasta;g3.fasta;g4.fasta S 10 ATCTGATTGGTAC ORI:Z:0;3 OFFSETS:Z:325307;325307

Means that segment 10 is observed in g1.fasta at position 325307 and g4.fasta at position 325307.

Given that reveal will always output a directed acyclic graph, this way paths should be trivial to extract from the segment annotations (for example "reveal extract", traverses the graph to reconstruct the input sequence from the graph).

However, I can imagine that it could be easier for downstream analysis, so if I find some time I will consider adding the paths to the GFA file as well.

Hope this helps.

Cheers, Jasper

jasperlinthorst commented 7 years ago

BTW, you should now be able to index larger (or more) genomes by specifying the "reveal --64 align ...". Reveal will then use 64bit suffix arrays, however, this comes at the cost of an even higher memory usage of course...

fbemm commented 7 years ago

Hey Jasper,

I think what would be useful if you output everything in GFA 2.0 fromat right away (https://github.com/GFA-spec/GFA-spec/blob/master/GFA2.md). Anyway GFA 1.0 with paths would be greatly appreciated as well!

Thanks Felix

jasperlinthorst commented 7 years ago

Hi Felix, check the latest commit, if you specify "reveal align --paths ..." the GFA 1.0 output should contain the paths (named by the original input file) corresponding to all genomes/sequences that are encoded in the graph. On my testcases this seemed to work, but I havent tested it thoroughly yet... Let me know if this is what you needed.

jasperlinthorst / reveal

GFA Paths #8