vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

vg index - error: [write_gcsa_kmers()] size limit exceeded #3411

Open SimonaSecomandi opened 3 years ago

SimonaSecomandi commented 3 years ago

1. What were you trying to do?

Call variants with vg. I'm trying to align of paired-end illumina reads from different bird individuals to a vg graph generated with the Cactus Pangenome Pipeline (https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md). The vg file comprires 10 high-quality assemblies (primary and alternate) of 5 individuals of the same species.

2. What did you want to happen?

Create index with 16 kmers for vg map.

3. What actually happened?

error: [write_gcsa_kmers()] size limit exceeded

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

No stacktrace.

5. What data and command can the vg dev team use to make the problem happen?

vg index -t 32 -x pangenome_k16.xg -g pangenome_k16.gcsa -k 16 -b pangenome.vg

6. What does running vg version say?

vg version v1.30.0 "Carentino"
Compiled with g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 on Linux
Linked against libstd++ 20200808
Built by anovak@octagon

Many thanks!

adamnovak commented 3 years ago

@glennhickey What's the current recommended pruning workflow for squashing a generic Cactus graph into something simple enough to make a GCSA index out of?

glennhickey commented 3 years ago

I will punt this to @jmonlong, who's just gotten this working on our latest graphs. I think -k 64 is one of the key things.

jmonlong commented 3 years ago

I've had some success by using a higher kmer size during the pruning. I didn't change the default kmer size used for the gcsa indexing. I ran something like:

vg prune -k 45 graph.vg > graph.pruned.vg
mkdir -p tmp_gcsa
vg index --temp-dir tmp_gcsa -p -g graph.gcsa graph.pruned.vg

In practice, I had first decomposed the graph in components with vg chunk, prune each graph separately and then fed all these pruned graph to vg index to minimize memory usage.

Pruning with kmer size 45 and 64 reduced the memory usage enough that I could gcsa index a human-size cactus graph with 100-200Gb of memory.

I hope this helps.

SimonaSecomandi commented 3 years ago

Thank you for your support! I generated the index with those commands:

vg prune -t 32 -k 45 pangenome.vg >pangenome_pruned.vg
vg mod -t 32 -X 256 pangenome_pruned.vg > pangenome_pruned_mod.vg
vg index -t 32 -b /tmp -p -x pangenome_pruned_mod_mod.xg -g pangenome_pruned_mod.gcsa pangenome_pruned_mod.vg
glennhickey commented 3 years ago

You only want to use prune for the gcsa index. Your commands should look like

vg mod -X 256 pangenome.vg > pangenome_choppped.vg
vg index pangenome_chopped.vg -x pangenome_chopped.xg
vg prune -t 32 -k 45 pangenome_chopped.vg > pangenome_pruned.vg
vg index -t 32 -b /tmp -p  -g pangenome_pruned_mod.gcsa pangenome_pruned.vg
SimonaSecomandi commented 3 years ago

Hello, I managed to index, prune and align my illumina WGS data against the variation graph.

vg mod -t 32 -X 256 .pangenome.vg > pangenome_chopped.vg
vg index -t 32 -x pangenome_chopped.xg pangenome_chopped.vg
vg prune -t 32 -k 45 pangenome_chopped.vg > pangenome_chopped_pruned.vg
vg index -t 32 -b /tmp -p -g pangenome_chopped_pruned.gcsa pangenome_chopped_pruned.vg
vg map -t 32 -f WGS_forward.fastq.gz -f WGS_reverse.fastq.gz -x pangenome_chopped.xg -g pangenome_chopped_pruned.gcsa > WGS_aln.gam

The next step would be the augmentation step right? Which of the generate vg files I have to use? The original, the chopped or the pruned?

e.g. vg augment pangenome.vg WGS_aln.gam -A WGS_aug.gam > WGS_aug.vg

I'm struggling figuring our the next steps for variant calling in general..