vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

Unable to create Yeast graph index #286

Closed willgdjones closed 8 years ago

willgdjones commented 8 years ago

I'm having difficulties building the index for my yeast variation graph.

I constructed the graph from a multi-sample vcf with a max node size of 100.

./vg construct -r data/SGD_2010.fasta -v data/SGRP2-cerevisiae-freebayes-snps-Q30-GQ30.vcf.gz -m 100 -a > graphs/yeastgraph.vg

This constructs just fine.

However building the index:

vg index -x yeastgraph.xg -g yeastgraph.gcsa -k 16 -p -t 4 yeastgraph.vg loading graph [================================]100.0% processing kmers of yeastgraph [= ] 2.9%

Gets stuck on this. Also, it also doesn't complete when I ran it on Sanger farm overnight.

Might be something wrong with the graph, here's the head:

H VN:Z:1.0 S 1 CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACAC P 1 chr1 1 + 57M L 1 + 2 + 0M L 1 + 3 + 0M S 2 A P 2 chr1 2 + 1M L 2 + 4 + 0M S 3 C L 3 + 4 + 0M S 4 CACA P 4 chr1 3 + 4M L 4 + 5 + 0M L 4 + 6 + 0M S 5 T P 5 chr1 4 + 1M L 5 + 7 + 0M S 6 C L 6 + 7 + 0M S 7 CCTAACACTACCCTAACAC

ekg commented 8 years ago

It looks like you're stuck on a region with a lot of kmers. This can happen where there are a lot of variants close together. We're not yet using the haplotypes from the VCF in the indexing, so we have to prune these regions out using a filter that removes edges from the graph which induce more than some number of bifurcations in our kmer size.

Here's how you do it:

vg mod -pl 16 -t 16 -e 3 x.vg \
    | vg mod -S -l 32 - \
    | vg kmers -gB -k 16 -F -H 1000000000 -T 1000000001 -t 16 - \
        > x.graph

This generates a pruned deBruijn graph. We are cutting out edges that would cause 3 bifurcations in 16 bp. This effectively masks out ultra-variable regions of the graph. The masking should be minimal and you can check by looking at the size of the graph after the pruning (before vg kmers).

Now index the graph with GCSA2.

vg index -i x.graph -g x.gcsa

Let me know how you get on with this.

willgdjones commented 8 years ago

Great thanks Erik I'm pruning now - taking a little while but I'll keep you updated.

willgdjones commented 8 years ago

Amazingly the prune operation is still running!

ekg commented 8 years ago

Try it on a smaller graph. Does it work?

On Fri, Apr 1, 2016, 15:12 William Jones notifications@github.com wrote:

Amazingly the prune operation is still running!

— You are receiving this because you commented.

Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/286#issuecomment-204411589

adamnovak commented 8 years ago

Are you missing a "-" argument to vg kmers?

ekg commented 8 years ago

Yes I am! I am editing the example to match.

Sorry about that Will. An important debugging technique: its always worth dropping into htop or top to check that the process is running and not waiting.

That said maybe something else is happening here.

ekg commented 8 years ago

@willgdjones how's it going? Did you resolve this?

ekg commented 8 years ago

@willgdjones ping; how'd it go?

willgdjones commented 8 years ago

Edit that's not what's happening.

Looking just at chr1: vg mod -pl 16 -t 16 -e 3 doesn't change anything, either there is something strange with the VG I'm using or maybe there is something buggy in VG mod.

Also indexing just on chr1 hasn't completed yet - neither on the farm nor my local machine :/.

ekg commented 8 years ago

What happens? How do I reproduce the problem?

On Wed, Apr 6, 2016 at 12:17 PM William Jones notifications@github.com wrote:

Still unable to run vg mod -pl 16 -t 16 -e 3 on chr1, either there is something strange with the VG I'm using or maybe there is something buggy in VG mod..

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/286#issuecomment-206290159

ekg commented 8 years ago

@willgdjones break the problem down into pieces by making a bunch of small graphs for regions containing a given number of variants.

You can use the vcfevenregions tool in vcflib to do this. It's on our servers in /lustre/scratch113/projects/graphs/bin/vcfevenregions.

Run the construct and mod cycle for each region. Use time to see how long it takes. You'll find any offending bits as the ones which take a long time.

As for vg version, you can use a known static build, such as /lustre/scratch113/projects/graphs/bin/vg-81b5a2cb.

Another solution may be to just increase the pruning rate. For instance, use -pl 16 -e 4 -t 16.

ekg commented 8 years ago

Looking just at chr1: vg mod -pl 16 -t 16 -e 3 doesn't change anything, either there is something strange with the VG I'm using or maybe there is something buggy in VG mod.

It won't change anything unless you have a cluster of 3 SNPs in 16bp.

Do you get the same graph before and after? (Use md5sum to check.)

willgdjones commented 8 years ago

So! It turns out that I've finished indexing it! Sorry for the silence I was just making sure it was working out. Here are the things I've learned:

ekg commented 8 years ago

How do you know that the whole genome pruning didn't work? What happened? Could we try to reproduce the problem?

ekg commented 8 years ago

So many questions.

In summary, congrats @willgdjones. You've got your pan-genome :)

If you want to start mapping against the index, I'd suggest using 81b5a2cb with parameters like -GX 0.9 -A 5. I'm considering making these default because they're giving a nice balance between performance and alignment quality.

willgdjones commented 8 years ago

Now now trying to redo full genome pruning just to make sure it really wasn't working last time, not just taking longer than expected.

And thanks! However I'm still getting some funny behaviour with vg find, even though I've built all the indexes. Surprisingly I can vg map with no problem but:

vg find -r 1:10 -x output/chromosomes/chr1/yeastgraphchr1pruned.xg -g output/chromosomes/chr1/yeastgraphchr1pruned.gcsa output/chromosomes/chr1/yeastgraphchr1pruned.vg

gives me

terminate called after throwing an instance of 'vg::indexOpenException' what(): unable to open variant graph index Abort trap: 6

ekg commented 8 years ago

What are you trying to do with vg find?

The error is because it's looking for a rocksdb index with the same base name as the .vg. We should probably axe this behavior.

Anyway, you don't need to pass it the graph--- the graph is fully encoded in the .xg index and the paths through it. But, vg find -x output/chromosomes/chr1/yeastgraphchr1pruned.xg -g output/chromosomes/chr1/yeastgraphchr1pruned.gcsa will do nothing.

ekg commented 8 years ago

Also, vg map doesn't require the .vg file.

.vg files are like an interchange format for the graph. But, they don't support any of the operations we need when we're querying or mapping against the graph.

willgdjones commented 8 years ago

I see! Leaving out the .vg file did the trick and I'm playing around with visualising things nicely

willgdjones commented 8 years ago

Hey Eric - any way to grab a subset of a graph? vg find -r N:M returns the nodes between N and M but doesn't seem to display any edges. Would be great to display the subgraph generated between the nodes N:M

ekg commented 8 years ago

@willgdjones you expand to the neighborhood around these using -c.

Try -c 10.

The help text should read -c, --context STEPS expand the context of the subgraph this many steps. I'll fix it in the next push.

ekg commented 8 years ago

@willgdjones @edawson and I have figured out the problem.

vg index -x appears to hang because the graph has been constructed with alternate allele paths. As a result, there are huge numbers of paths, which the indexer uses by default as expensive (expanded) positional paths. These are organized in xg in such a way as to make it easy to figure out where in the graph corresponds to a certain offset in a given reference sequence, and also to go from a part of the graph to positions in the sequence. These are made O(1) operations by an extremely memory inefficient layout of the paths.

For paths which we just want to store and decompress, or which we'd like to use for haplotype frequency counting, there is gPBWT. That's a work in progress. Presumably, when it lands, we will want to say which paths to use for positional indexing and which we want to use in gPBWT.

So the solution is to not use the -a flag in vg construct. Then, indexing the graph with vg index -x takes about 15 seconds.