vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.08k stars 192 forks source link

Make all the VG commands using VGSet know how to read a file of filenames as input #234

Open cartoonist opened 8 years ago

cartoonist commented 8 years ago

I'm trying to construct and index a whole genome variation graph of a relatively small genome containing ~17200 short regions. I constructed variation graphs for each region separately. I also generated a joint id space across each graph by using vg ids. When I try to create xg index, I got this error message:

$ vg index -x wg.wx *.vg
[vg::map] could not concatenate graphs

In addition, when I try to explicitly indicate the file name of variation graphs, it reaches the ARG_MAX limit and this error message appears:

Argument list too long
edawson commented 8 years ago

Hey Ali - that is a lot of files. Have you tried a subset of them as a smoke test?

vg index -x test.xg 1.vg 2.vg

Peeking at the source code, it looks like the [vg::map] could not concatenate graphs is probably triggered by the same ARG_MAX limit. To concatenate the graphs we just do a basic cat and pass each argument as a temporary graph, so it'll still get tripped up with so many files. Perhaps try doing them in batches by manually catting a subset of your graphs:

  ## Write a file that lists all of the input files
  for i in `ls | grep ".vg"`; do echo $i >> files.txt; done

 ## Batch process files, 100 at a time. May need to use more or fewer files in a batch.
 for i in `seq 100 100 17200`
       j=(expr $i - 100)
       cat `sed -n ' files.txt | sed ':a;N;$!ba;s/\n/ /g''` > $j_$i.vg
 done

 ## Now cat the catted files. There will be fewer now.
 for i in `ls | grep -o "[0-9]*_[0-9]*.vg"
      do cat $i >> merged.vg
 done

## Finally, index this (now gigantic) graph.
vg index -x merged.xg merged.vg

I'm not sure this will work, but I suspect it will be a step in the right direction. Hopefully @ekg or @adamnovak can chime in when they get some free time at their conference.

Magic sed line: http://unix.stackexchange.com/questions/114943/can-sed-replace-new-line-characters Grab specific range of lines from file: http://stackoverflow.com/questions/191364/quick-unix-command-to-display-specific-lines-in-the-middle-of-a-file Seq: man seq

ekg commented 8 years ago

Just concatenate those graphs together and try it again. You can loop over cat 1.vg >>combined.vg and it should work.

adamnovak commented 8 years ago

The .vg notation will not work around your argument list length problem, by the way. The is expanded by the shell, so vg gets the list of all the matching files. If that list is too long, I don't know what exactly will happen, but it won't work correctly. It might just cancel the expansion and pass along the literal ".vg". If that happens, or if you otherwise you get the shell not to expand it (like by using quotes), vg will see ".vg" literally, which is not a vg file that it can open, and so it won't work.

cartoonist commented 8 years ago

Thanks @ekg, @edawson, and @adamnovak. It's good to know that vg files can be merged by simply concatenating them using cat. I didn't know that and I think it will solve my problem. I'll try.

that is a lot of files. Have you tried a subset of them as a smoke test?

Yes, I tried and it worked for fewer number of vg files without problem.

I'm not sure about the details, but It seems that the wildcard expansion is done successfully. The E2BIG error (whose result is error message "Argument list too long" and it's defined in <sys/errno.h>), occurs in exec() system call. So, when I run vg (or other commands) by explicitly specifying the names of the vg files I got this error message (for example, here I used vg ids to create a joint id space across the graphs):

$ find . -iname "*.vg" | tr '\n' ' ' | xargs -0I {} -- vg ids -j {}
xargs: Argument list too long

But when I use wildcard, it works fine:

$ vg ids -j *.vg

But for indexing, I did the same trick (using wildcard) as a workaround for this issue:

$ vg index -x wg.xg *.vg
[vg::map] could not concatenate graphs

I don't get E2BIG error message, but vg fails to index. That's why I think there's some internal problems in this regard: maybe some external commands whose length exceed ARG_MAX limit are executed internally.

ekg commented 8 years ago

Your assessment is right. vg index is running a concatenate command to put all the files together internally. This must exceed the command length limits.

I think the only solution for large numbers of files is to concatenate them together externally. Now, the ID space resolution will be a pain, but it can be scripted out by taking the files in order. For each file, increment the IDs by the maximum ID we have seen, then record the max ID of the graph as the new maximum ID. You can loop through the files and do this to ensure the ID space has no collisions. I do not think the right thing will happen here if you concatenate the files before doing this. Then, concatenation will work and produce a valid graph.

This is all pretty annoying and should be streamlined. We could implement file lists (a file with one .vg file per line) as a way to do this. I think this is somewhere between a feature request and a bug. Thoughts? On Mar 2, 2016 9:32 AM, "Ali Ghaffaari" notifications@github.com wrote:

Thanks @ekg https://github.com/ekg, @edawson https://github.com/edawson, and @adamnovak https://github.com/adamnovak. It's good to know that vg files can be merged by simply concatenating them using cat. I didn't know that, I think It will solve my problem. I'll try.

that is a lot of files. Have you tried a subset of them as a smoke test?

Yes, I tried and it worked for fewer number of vg files without problem.

I'm not sure about the details, but It seems that the wildcard expansion is done successfully. The E2BIG error (whose result is error message "Argument list too long" and It's defined in <sys/errno.h>), occurs in exec() system call. So, when I run vg (or other commands) by explicitly specifying the names of the vg files I got this error message (for example, here I used vg ids to create a joint id space across the graphs):

$ find . -iname "*.vg" | tr '\n' ' ' | xargs -0I {} -- vg ids -j {} xargs: Argument list too long

But when I use wildcard, it works fine:

$ vg ids -j *.vg

But for indexing, I did the same trick (using wildcard) as a workaround for this issue.

$ vg index -x wg.xg *.vg [vg::map] could not concatenate graphs

I don't get E2BIG error message, but vg fails to index. That's why I think there's some internal problems in this regard: maybe some external commands whose length exceed ARG_MAX limit are executed internally.

— Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/234#issuecomment-191127555.

cartoonist commented 8 years ago

As a rough idea, vg merge would be a useful command which merges all given .vg files into one big .vg file with collision-free ID space. The input .vg file names for this command can be provided as either command-line argument or a file list. Then, one can use the resulting file for all sort of vg commands that get multiple .vg files.

ekg commented 8 years ago

It might be easier to teach vg ids to read a file list. Then you can do the merge as a ID space unification followed by cat. That said I won't stop anyone from making vg merge! On Mar 2, 2016 10:38 AM, "Ali Ghaffaari" notifications@github.com wrote:

As a rough idea, vg merge would be a useful command which merges all given .vg files into one big .vg file with collision-free ID space. The input .vg file names for this command can be provided as either command-line argument or a file list. Then, one can use the resulting file for all sort of vg commands that get multiple .vg files.

— Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/234#issuecomment-191156333.

ekg commented 8 years ago

@cartoonist Have you managed to resolve this (even in a hacky way as described here)?

The issue is open because this shouldn't need to be scripted out.

ekg commented 8 years ago

@cartoonist have you tried building the graph off of the reference FASTA made of the 17200 contigs? It seems like it might just work if it's small. The tutorial focused on the problem of building the graph for a very large genome.

cartoonist commented 8 years ago

Hi @ekg, I was on vacation. Sorry for late reply. I will check and inform you about the way I could manage to create the graph in few days.

ekg commented 8 years ago

I've been testing with the current HEAD and things are going pretty well. Still some things that need to be documented for handling large graphs. I'll be updating the wiki to explain.

On Wed, Mar 30, 2016 at 12:26 PM Ali Ghaffaari notifications@github.com wrote:

Hi @ekg https://github.com/ekg, I was on vacation. Sorry for late reply. I will check and inform you about the way I could manage to create the graph in few days.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/234#issuecomment-203387401

adamnovak commented 7 years ago

Is this still a problem? And is the fact that people may want to operate on more graphs than they can fit on a command line still in scope for vg?

Do we want to change the issue to something like "Make all the VG commands using VGSet know how to read a file of filenames as input"?

ekg commented 7 years ago

Makes sense to me.

On Wed, Oct 19, 2016, 18:23 Adam Novak notifications@github.com wrote:

Is this still a problem? And is the fact that people may want to operate on more graphs than they can fit on a command line still in scope for vg?

Do we want to change the issue to something like "Make all the VG commands using VGSet know how to read a file of filenames as input"?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/234#issuecomment-254865096, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EWJy0ScL1y-psZG497qQaMR-hYqFks5q1kQVgaJpZM4HhcdL .