Closed willgdjones closed 8 years ago
It looks like you're stuck on a region with a lot of kmers. This can happen where there are a lot of variants close together. We're not yet using the haplotypes from the VCF in the indexing, so we have to prune these regions out using a filter that removes edges from the graph which induce more than some number of bifurcations in our kmer size.
Here's how you do it:
vg mod -pl 16 -t 16 -e 3 x.vg \
| vg mod -S -l 32 - \
| vg kmers -gB -k 16 -F -H 1000000000 -T 1000000001 -t 16 - \
> x.graph
This generates a pruned deBruijn graph. We are cutting out edges that would cause 3 bifurcations in 16 bp. This effectively masks out ultra-variable regions of the graph. The masking should be minimal and you can check by looking at the size of the graph after the pruning (before vg kmers).
Now index the graph with GCSA2.
vg index -i x.graph -g x.gcsa
Let me know how you get on with this.
Great thanks Erik I'm pruning now - taking a little while but I'll keep you updated.
Amazingly the prune operation is still running!
Try it on a smaller graph. Does it work?
On Fri, Apr 1, 2016, 15:12 William Jones notifications@github.com wrote:
Amazingly the prune operation is still running!
— You are receiving this because you commented.
Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/286#issuecomment-204411589
Are you missing a "-" argument to vg kmers?
Yes I am! I am editing the example to match.
Sorry about that Will. An important debugging technique: its always worth dropping into htop or top to check that the process is running and not waiting.
That said maybe something else is happening here.
@willgdjones how's it going? Did you resolve this?
@willgdjones ping; how'd it go?
Edit that's not what's happening.
Looking just at chr1: vg mod -pl 16 -t 16 -e 3 doesn't change anything, either there is something strange with the VG I'm using or maybe there is something buggy in VG mod.
Also indexing just on chr1 hasn't completed yet - neither on the farm nor my local machine :/.
What happens? How do I reproduce the problem?
On Wed, Apr 6, 2016 at 12:17 PM William Jones notifications@github.com wrote:
Still unable to run vg mod -pl 16 -t 16 -e 3 on chr1, either there is something strange with the VG I'm using or maybe there is something buggy in VG mod..
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/286#issuecomment-206290159
@willgdjones break the problem down into pieces by making a bunch of small graphs for regions containing a given number of variants.
You can use the vcfevenregions
tool in vcflib to do this. It's on our servers in /lustre/scratch113/projects/graphs/bin/vcfevenregions.
Run the construct and mod cycle for each region. Use time
to see how long it takes. You'll find any offending bits as the ones which take a long time.
As for vg version, you can use a known static build, such as /lustre/scratch113/projects/graphs/bin/vg-81b5a2cb.
Another solution may be to just increase the pruning rate. For instance, use -pl 16 -e 4 -t 16
.
Looking just at chr1: vg mod -pl 16 -t 16 -e 3 doesn't change anything, either there is something strange with the VG I'm using or maybe there is something buggy in VG mod.
It won't change anything unless you have a cluster of 3 SNPs in 16bp.
Do you get the same graph before and after? (Use md5sum to check.)
So! It turns out that I've finished indexing it! Sorry for the silence I was just making sure it was working out. Here are the things I've learned:
vg kmer
command that caused the problem, just using vg mod -pl 16 -t 16 -e 3 x.vg | vg mod -S -l 32 -
and indexing the resulting works How do you know that the whole genome pruning didn't work? What happened? Could we try to reproduce the problem?
So many questions.
In summary, congrats @willgdjones. You've got your pan-genome :)
If you want to start mapping against the index, I'd suggest using 81b5a2cb with parameters like -GX 0.9 -A 5
. I'm considering making these default because they're giving a nice balance between performance and alignment quality.
Now now trying to redo full genome pruning just to make sure it really wasn't working last time, not just taking longer than expected.
And thanks! However I'm still getting some funny behaviour with vg find
, even though I've built all the indexes. Surprisingly I can vg map
with no problem but:
vg find -r 1:10 -x output/chromosomes/chr1/yeastgraphchr1pruned.xg -g output/chromosomes/chr1/yeastgraphchr1pruned.gcsa output/chromosomes/chr1/yeastgraphchr1pruned.vg
gives me
terminate called after throwing an instance of 'vg::indexOpenException' what(): unable to open variant graph index Abort trap: 6
What are you trying to do with vg find?
The error is because it's looking for a rocksdb index with the same base name as the .vg. We should probably axe this behavior.
Anyway, you don't need to pass it the graph--- the graph is fully encoded in the .xg index and the paths through it. But, vg find -x output/chromosomes/chr1/yeastgraphchr1pruned.xg -g output/chromosomes/chr1/yeastgraphchr1pruned.gcsa
will do nothing.
Also, vg map
doesn't require the .vg file.
.vg files are like an interchange format for the graph. But, they don't support any of the operations we need when we're querying or mapping against the graph.
I see! Leaving out the .vg file did the trick and I'm playing around with visualising things nicely
Hey Eric - any way to grab a subset of a graph? vg find -r N:M
returns the nodes between N and M but doesn't seem to display any edges. Would be great to display the subgraph generated between the nodes N:M
@willgdjones you expand to the neighborhood around these using -c
.
Try -c 10
.
The help text should read -c, --context STEPS expand the context of the subgraph this many steps
. I'll fix it in the next push.
@willgdjones @edawson and I have figured out the problem.
vg index -x
appears to hang because the graph has been constructed with alternate allele paths. As a result, there are huge numbers of paths, which the indexer uses by default as expensive (expanded) positional paths. These are organized in xg in such a way as to make it easy to figure out where in the graph corresponds to a certain offset in a given reference sequence, and also to go from a part of the graph to positions in the sequence. These are made O(1) operations by an extremely memory inefficient layout of the paths.
For paths which we just want to store and decompress, or which we'd like to use for haplotype frequency counting, there is gPBWT. That's a work in progress. Presumably, when it lands, we will want to say which paths to use for positional indexing and which we want to use in gPBWT.
So the solution is to not use the -a
flag in vg construct
. Then, indexing the graph with vg index -x
takes about 15 seconds.
I'm having difficulties building the index for my yeast variation graph.
I constructed the graph from a multi-sample vcf with a max node size of 100.
./vg construct -r data/SGD_2010.fasta -v data/SGRP2-cerevisiae-freebayes-snps-Q30-GQ30.vcf.gz -m 100 -a > graphs/yeastgraph.vg
This constructs just fine.
However building the index:
vg index -x yeastgraph.xg -g yeastgraph.gcsa -k 16 -p -t 4 yeastgraph.vg loading graph [================================]100.0% processing kmers of yeastgraph [= ] 2.9%
Gets stuck on this. Also, it also doesn't complete when I ran it on Sanger farm overnight.
Might be something wrong with the graph, here's the head:
H VN:Z:1.0 S 1 CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACAC P 1 chr1 1 + 57M L 1 + 2 + 0M L 1 + 3 + 0M S 2 A P 2 chr1 2 + 1M L 2 + 4 + 0M S 3 C L 3 + 4 + 0M S 4 CACA P 4 chr1 3 + 4M L 4 + 5 + 0M L 4 + 6 + 0M S 5 T P 5 chr1 4 + 1M L 5 + 7 + 0M S 6 C L 6 + 7 + 0M S 7 CCTAACACTACCCTAACAC