vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

GFA input takes a very long time #1809

Open ekg opened 6 years ago

ekg commented 6 years ago

I was trying to convert a GFA-formatted graph produced by seqwish to .vg format.

I used zcat MHC.fa.gfa.gz | vg view -Fv - >MHC.fa.vg

It took almost 10 hours. That's 4 times longer than it took to construct the graph with seqwish!

vg view -Fv -  25861.28s user 7.82s system 74% cpu 9:39:11.80 total                                                                                                                                                  

Could we implement a streaming conversion?

It's not clear to me why reading the graph in as GFA is so much more expensive than reading in a .vg format one. It would seem that we have some kind of quadratic penalty in the GFA parsing.

edawson commented 6 years ago

This is an issue with the paths in GFAKluge. If we restrict the same GFA to only header/segment/link lines, the same parser reads the file in 3 seconds.

  grep "^S\|^L\|^H" MHC.fa.gfa > hsl_lines.gfa
  time ./gfak stats hsl_lines.gfa  

   Number of nodes: 224873
   Number of edges: 643978
   Number of links: 643978
   Number of containments: 0
   Total graph length in basepairs: 9764108
   N50: 26444
   N90: 123
   L50: 58
   L90: 7371
  real  0m3.992s
  user  0m3.594s
  sys   0m0.353s

Otherwise it exceeds the length of time I'm willing to wait for it to finish. I'm hunting down where this goes wrong but I think it's in trying to harmonize the GFA1/GFA2 representations.

edawson commented 6 years ago

This is due to the way I did path/walk harmonization in the paths_as_walks and walks_as_paths functions. The right answer is to just rip these out and store Walks as path_elems behind the curtain. I should probably refactor this further to use group_elems for paths, and deprecate the path_elem struct entirely (or have it just shadow the group_elem struct).

The quick fix is to deprecate the W line type (which only vg has ever used). @ekg how do you feel about this?

ekg commented 6 years ago

Deprecate the W type of that's the quickest fix. No reason to use it if it is only vg that ever implemented. Bring on the 500M strings.

On Fri, Aug 3, 2018, 15:49 Eric T. Dawson notifications@github.com wrote:

This is due to the way I did path/walk harmonization in the paths_as_walks and walks_as_paths functions. The right answer is to just rip these out and store Walks as path_elems behind the curtain. I should probably refactor this further to use group_elems for paths, and deprecate the path_elem struct entirely (or have it just shadow the group_elem struct).

The quick fix is to deprecate the W line type (which only vg has ever used). @ekg https://github.com/ekg how do you feel about this?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1809#issuecomment-410252821, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EZrUPos-4_P7NI7b2EEO22gfmvknks5uNFUBgaJpZM4VtdW9 .

ekg commented 6 years ago

I also don't quite understand why either P or W would cause problems in the parsing. Are we checking that the path is specified only once? Even then it should be a linear scan through the input to convert it to vg format.