vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

vg view produces invalid .vg file when converting GFA #1326

Closed koadman closed 6 years ago

koadman commented 6 years ago

I have an assembly of the NA12878 human reference sample in GFA format. Running vg view to convert to a .vg file runs to completion (albeit using >300GB RAM) but the resulting file appears to be invalid. No further subcommands can use the file, in particular vg mod -X. The program just exits immediately without error. I have posted the problematic vg file to google drive: https://drive.google.com/open?id=1h2D5GhXzcZT5P7dIdR5Q-GsBI_4L6Oo-

I can also provide the GFA if that would help track down the problem.

ekg commented 6 years ago

The GFA would help as well. You can feed in subsets of the GFA as well. Have you tried any subsets to see if there is a part of it for which it works? By subset I mean something like head -10000 NA12878.gfa | vg view -Fv - >NA12878.vg.

On Fri, Dec 15, 2017, 08:13 Aaron Darling notifications@github.com wrote:

I have an assembly of the NA12878 human reference sample in GFA format. Running vg view to convert to a .vg file runs to completion (albeit using

300GB RAM) but the resulting file appears to be invalid. No further subcommands can use the file, in particular vg mod -X. The program just exits immediately without error. I have posted the problematic vg file to google drive: https://drive.google.com/open?id=1h2D5GhXzcZT5P7dIdR5Q-GsBI_4L6Oo-

I can also provide the GFA if that would help track down the problem.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1326, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EfAh5nb5Orz4DrqmHRNNEIcO1MFHks5tAhwQgaJpZM4RDGQ0 .

ekg commented 6 years ago

How did you make the GFA?

On Fri, Dec 15, 2017, 17:30 Erik Garrison erik.garrison@gmail.com wrote:

The GFA would help as well. You can feed in subsets of the GFA as well. Have you tried any subsets to see if there is a part of it for which it works? By subset I mean something like head -10000 NA12878.gfa | vg view -Fv - >NA12878.vg.

On Fri, Dec 15, 2017, 08:13 Aaron Darling notifications@github.com wrote:

I have an assembly of the NA12878 human reference sample in GFA format. Running vg view to convert to a .vg file runs to completion (albeit using >300GB RAM) but the resulting file appears to be invalid. No further subcommands can use the file, in particular vg mod -X. The program just exits immediately without error. I have posted the problematic vg file to google drive: https://drive.google.com/open?id=1h2D5GhXzcZT5P7dIdR5Q-GsBI_4L6Oo-

I can also provide the GFA if that would help track down the problem.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1326, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EfAh5nb5Orz4DrqmHRNNEIcO1MFHks5tAhwQgaJpZM4RDGQ0 .

edawson commented 6 years ago

The mem usage is a bit frightening; your GFA would be helpful if figuring out whether its the parser or the graph itself that's so large.

If you've got a subest of the GFA containing at least one line from each of the GFA line types that would be useful in case its a parsing thing.

koadman commented 6 years ago

Thanks for the quick replies! I uploaded the gzip'd GFA here: https://drive.google.com/open?id=1T88qYFboENkgX-biUYvQqwx4Th4g6jSl

I haven't tried subsets of the GFA yet, didn't even know that was possible but will give it a go and report back. The GFA was made with the minia assembler from NovaSeq data. @edawson yes the mem usage is high. I'm sure graph complexity is part of the problem because I ran the assembly with conservative graph simplification parameters, but minia uses around 10GB or less to make the assembly so surely there is room to improve on what vg is currently doing.

koadman commented 6 years ago

Reporting back on trying to convert a subset, it seems to fail because there are link references to nodes outside the subset. For example:

zcat NA12878_refasm.gfa.gz | head -n 10000 | vg view -Fv - > NA12878_partial.vg
terminate called after throwing an instance of 'std::runtime_error'
  what():  No node 9522167 in graph
Aborted

Node 9522167 is referenced in the first link statement at the beginning of the file:

H       VN:Z:1.0        kk:i:51
S       0       TGAGCCACTGCGCCTGGCCCCTATATAGTCTTTATAATTATTGTTTAATTCTTTTGAGAAGAACCTATAGATCTTTATAATTATTGTTTCTTTTGAGAAGAACCTTGTTAAGAAAAAAATCACCAAGACTAAGGGAGAACAC
L       0       +       9522167 -       50M
L       0       +       32311492        +       50M
L       0       -       18766258        -       50M
L       0       -       31057768        -       50M
S       1       GGAGGTTGCAGTGAGCCAAGATCGCACCATTGCACCCCAGCCTGGGCAACAGAGCGAGGCTCCGTCTCAAAAAATAAAATAAAATTAGATTAAATTAAATTAAATTCCCATACATATACTTGCCAGGATATACCTTGAAAAA
L       1       +       24158645        -       50M
L       1       -       3600299 -       50M
L       1       -       3600367 -       50M

I guess it is only possible to convert subsets that correspond to disjoint connected components of the graph?

adamnovak commented 6 years ago

OK, the problem here is that the 2.9 GB "vg" file you have generated is actually a GFA file. Did you include the -v option in the vg view command you used to generate it?

koadman commented 6 years ago

Thanks Adam, you are right, happily this is a simple case of user error. I seem to have left out the v option this time and naively assumed that because the file no longer looked like an ascii GFA it had been converted. Rerunning with -v produced a .vg file that I was able to split with vg mod -X 1024 and an xg index is now running.

Some closing thoughts: some kind of error message could help avoid user confusion, and the format conversion from GFA to vg continues to be a real memory hog. Possibly these could go into new github issues.

Thanks for your help guys.