Open ekg opened 6 years ago
This is an issue with the paths in GFAKluge. If we restrict the same GFA to only header/segment/link lines, the same parser reads the file in 3 seconds.
grep "^S\|^L\|^H" MHC.fa.gfa > hsl_lines.gfa
time ./gfak stats hsl_lines.gfa
Number of nodes: 224873
Number of edges: 643978
Number of links: 643978
Number of containments: 0
Total graph length in basepairs: 9764108
N50: 26444
N90: 123
L50: 58
L90: 7371
real 0m3.992s
user 0m3.594s
sys 0m0.353s
Otherwise it exceeds the length of time I'm willing to wait for it to finish. I'm hunting down where this goes wrong but I think it's in trying to harmonize the GFA1/GFA2 representations.
This is due to the way I did path/walk harmonization in the paths_as_walks
and walks_as_paths
functions. The right answer is to just rip these out and store Walks as path_elems behind the curtain. I should probably refactor this further to use group_elems for paths, and deprecate the path_elem struct entirely (or have it just shadow the group_elem struct).
The quick fix is to deprecate the W
line type (which only vg has ever used). @ekg how do you feel about this?
Deprecate the W type of that's the quickest fix. No reason to use it if it is only vg that ever implemented. Bring on the 500M strings.
On Fri, Aug 3, 2018, 15:49 Eric T. Dawson notifications@github.com wrote:
This is due to the way I did path/walk harmonization in the paths_as_walks and walks_as_paths functions. The right answer is to just rip these out and store Walks as path_elems behind the curtain. I should probably refactor this further to use group_elems for paths, and deprecate the path_elem struct entirely (or have it just shadow the group_elem struct).
The quick fix is to deprecate the W line type (which only vg has ever used). @ekg https://github.com/ekg how do you feel about this?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1809#issuecomment-410252821, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EZrUPos-4_P7NI7b2EEO22gfmvknks5uNFUBgaJpZM4VtdW9 .
I also don't quite understand why either P or W would cause problems in the parsing. Are we checking that the path is specified only once? Even then it should be a linear scan through the input to convert it to vg format.
I was trying to convert a GFA-formatted graph produced by seqwish to .vg format.
I used
zcat MHC.fa.gfa.gz | vg view -Fv - >MHC.fa.vg
It took almost 10 hours. That's 4 times longer than it took to construct the graph with seqwish!
Could we implement a streaming conversion?
It's not clear to me why reading the graph in as GFA is so much more expensive than reading in a .vg format one. It would seem that we have some kind of quadratic penalty in the GFA parsing.