vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

vg index failure #1760

Open TanyaDvorkina opened 6 years ago

TanyaDvorkina commented 6 years ago

Hello!

We are trying to build an index of C.elegans assembly graph and getting an error:

>> ./vg view -vF celegans_graph.gfa > celegans_graph.vg
>> ./vg mod -X 1024 celegans_graph.vg > celegans_graph.split.vg
>> ./vg index -p -t 16 celegans_graph.split.vg -x celegans_graph.split.vg.xg -g celegans_graph.split.vg.gcsa -k 15
Built base XG index
Saving XG index to disk...
Generating kmer files...
 loading graph                  [======================================================================================================]100.0%
DiskIO::write(): Write failed

VG binary from latest release. Please find gfa here https://drive.google.com/open?id=14oVyiT4ISvUvD6CXbOOcWPBXFdjoowZW

glennhickey commented 6 years ago

vg index -g can take up to 2T of disk space for temporary files. Perhaps you're running out of disk?

Pruning the graph with vg prune before gcsa indexing is sometimes necessary, and can be something to try. There's an example here https://github.com/vgteam/vg/wiki/working-with-a-whole-genome-variation-graph

On Tue, Jun 26, 2018 at 11:33 AM, TanyaDvorkina notifications@github.com wrote:

Hello!

We are trying to build an index of C.elegans assembly graph and getting an error:

./vg view -vF celegans_graph.gfa > celegans_graph.vg ./vg mod -X 1024 celegans_graph.vg > celegans_graph.split.vg ./vg index -p -t 16 celegans_graph.split.vg -x celegans_graph.split.vg.xg -g celegans_graph.split.vg.gcsa -k 15 Built base XG index Saving XG index to disk... Generating kmer files... loading graph [======================================================================================================]100.0% DiskIO::write(): Write failed

VG binary from latest release. Please find gfa here https://drive.google.com/open?id= 14oVyiT4ISvUvD6CXbOOcWPBXFdjoowZW

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1760, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7ugsFOYW3DzRCrK2Ylc3Wk_fGW0Zks5uAlRWgaJpZM4U4MA8 .

jltsiren commented 6 years ago

This is a challenging graph for VG. The original failure was caused by lack of disk space in the temporary directory. When I tried to rerun the commands on a system with 180-200 GB of free memory, I ran out of memory during kmer generation. The kmer file was 43 GB at that point. I tried pruning the graph, but because pruning uses the same kmer traversal code, it also ran out of memory.

I tried replacing the -k 15 with -k 10 in the vg index command. This would have built a 160-mer index instead of a 240-mer index. Memory usage was under control now, but the kmer file just kept growing. I killed the job when the kmer file was 254 GB, which is more than when indexing 1000GP human graphs.

Then I tried pruning the graph with -k 12 -e 2, allowing 2 edge choices in a 12-mer:

vg prune -k 12 -e 2 -t 16 -p celegans_graph.split.vg >celegans_graph.pruned.vg

This simplified the graph from 209076 nodes, 241139 edges, and 93032100 bases to 187878 nodes, 194012 edges, and 92981820 bases. I then tried building a GCSA index for pruned graph:

vg index -p -t 16 celegans_graph.pruned.vg -g celegans_graph.split.vg.gcsa -k 15

And this worked.

The situation is essentially the same as the complex graph without a reference or haplotypes example in the wiki. You build an XG index for celegans_graph.split.vg, prune the graph, build a GCSA index for the pruned graph, and delete the pruned graph. Because our kmer traversal code apparently has issues with complex graphs, you have to use a shorter kmer length (and hence allow less edge choices) when pruning.

TanyaDvorkina commented 6 years ago

Thank you for a quick response and great explanation! Everything worked.

TanyaDvorkina commented 6 years ago

Hello,

Is it possible to align Pacbio reads on this graph? I took parameters from https://github.com/vgteam/vg/issues/770 and run vg map to align 100 PB reads generated by PbSim but got an error.

> ./vg map -t 1 -d celegans_graph.split.vg -x celegans_graph.split.vg.xg  -g celegans_graph.split.vg.gcsa -f  sim_pacbio.fastq -M 1 -w 2048 -W 64 > sim_pacbio.gam

src/central_freelist.cc:333] tcmalloc: allocation failed 49152
error:[gssw] Could not allocate memory required for alignment traceback matrixes.

File with reads is here https://drive.google.com/file/d/1BG1Rxqb5EwZcIHqquBLyKdvwh01ZTURh/view?usp=sharing

Thank you!

ekg commented 6 years ago

It is possible to align pacbio reads against assembly graphs.

You will likely need to prune the graph with vg mod -M, setting a number around ~8 to remove nodes with high degree. Then re-index.

I would also suggest using a lower -w parameter, such as the default (256).

On Wed, Jul 4, 2018 at 10:49 AM TanyaDvorkina notifications@github.com wrote:

Hello,

Is it possible to align Pacbio reads on this graph? I took parameters from #770 https://github.com/vgteam/vg/issues/770 and run vg map to align 100 PB reads generated by PbSim but got an error.

./vg map -t 1 -d celegans_graph.split.vg -x celegans.split.vg.xg -g celegans.split.vg.gcsa -f sim_pacbio.fastq -M 1 -w 2048 -W 64 > sim_pacbio.gam

src/central_freelist.cc:333] tcmalloc: allocation failed 49152 error:[gssw] Could not allocate memory required for alignment traceback matrixes.

File with reads is here https://drive.google.com/file/d/1BG1Rxqb5EwZcIHqquBLyKdvwh01ZTURh/view?usp=sharing

Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1760#issuecomment-402410486, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4ETVwD_UWu3CQuYu0IlCs3dGtyAg6ks5uDIGLgaJpZM4U4MA8 .

TanyaDvorkina commented 6 years ago

Thank you! I'll try it.