Memory limit exceeded during vg autoindex for GCSA/LCP indexing

HuangXZhuo commented 5 hours ago

Hello,

I am encountering an issue when running vg autoindex to construct a graph from a HG002 reference FASTA and VCF file. The command I am using is as follows: vg autoindex --workflow map --threads 24 --prefix /public1/home/sc30852/HG002/vg/graph --ref-fasta ../../hg002.mat.fasta --vcf ../mat.vcf.gz Here is part of the log output: [IndexRegistry]: Checking for phasing in VCF(s). [IndexRegistry]: Chunking inputs for parallelism. [IndexRegistry]: Chunking FASTA(s). [IndexRegistry]: Chunking VCF(s). [IndexRegistry]: Constructing VG graph from FASTA and VCF input. [IndexRegistry]: Constructing XG graph from VG graph. [IndexRegistry]: Pruning complex regions of VG to prepare for GCSA indexing. [IndexRegistry]: Constructing GCSA/LCP indexes. PathGraphBuilder::write(): Memory use of file 5 of kmer paths (503.81 GB) exceeds memory limit (503.781 GB).

It seems like the memory consumption during the GCSA indexing step exceeds the available memory (around 504 GB). Do you have any suggestions on how I can reduce memory usage, or is there a way to chunk the input differently to avoid this issue?

Any help would be appreciated!

Thank you!

jltsiren commented 4 hours ago

What do you have in the VCF file? Usually when GCSA construction runs out of memory, it happens because the graph is too complex, there are too many variants in repetitive regions, or there are too many nondeterministic locations (where the first/last bases of reference and alternative nodes are identical).

HuangXZhuo commented 4 hours ago

I used the command: nucmer -t 24 ../hg002.mat.fasta ../hg002.pat.fasta delta-filter -i 99 -l 100000 out.delta > out.delta.filter to align the maternal and paternal genomes of HG002 using MUMmer for a diploid mapping. Could it be that the VCF file generated from this process makes the graph too complex?

jltsiren commented 4 hours ago

I'm not very familiar with MUMmer, but I think the problem is that you have too much duplicated sequence in the graph. GCSA construction does not like that, because it can't collapse identical k-mers if they start from different positions in the graph.

If you want to build a graph based on two aligned haplotypes, Minigraph-Cactus should be a better choice. You can then map reads using Giraffe, which is faster than vg map. I'm not sure if the default clipped graph or the full unclipped graph is a better choice, so you should probably try both.

HuangXZhuo commented 4 hours ago

Thank you so much for your advice! I will definitely try using Minigraph-Cactus to build the graph. I appreciate your help and will let you know how it goes after testing. Thanks again!

ld9866 commented 3 hours ago

Dear developer: I used 29 genomes to get the vcf file using the Minigraph-Cactus pipeline, and now I want to do some pan-transcriptome analysis, so I need to convert the required file.

there is still enough storage, but the task automatically terminates after running for one day, I would like to ask how to solve this situation?

l only not obtain the sample.trans.spliced.gcsa and sample.trans.spliced.gcsa.lcp, other files are ok.

Best Dong

Code: vg autoindex --threads 32 --workflow mpmap --workflow rpvg --prefix sample.trans --ref-fasta reference.fa --vcf sample.result.vcf.gz --tx-gff Duroc.111.chr1-18.gtf --tmp-dir /home/test/nvmedata2/02.Pantrans/TMP -M 850G erro: warning:[vg::Constructor] Skipping duplicate variant with hash c5757e8eca1e42a9bafd6bf1aed0bacad2826367 at 1:274146418 [IndexRegistry]: Constructing GBWT from spliced VG graph and phased VCF input. [IndexRegistry]: Merging contig GBWTs. [IndexRegistry]: Stripping allele paths from spliced VG. [IndexRegistry]: Constructing haplotype-transcript GBWT and finishing spliced VG. [IndexRegistry]: Merging contig GBWTs. [IndexRegistry]: Joining transcript origin table. [IndexRegistry]: Constructing spliced XG graph from spliced VG graph. [IndexRegistry]: Constructing distance index for a spliced graph. [IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing with GBWT unfolding. [IndexRegistry]: Constructing GCSA/LCP indexes. PathGraphBuilder::write(): Memory use of file 0 of kmer paths (850.002 GB) exceeds memory limit (850 GB) PathGraphBuilder::write(): Memory use of file 0 of kmer paths (850.045 GB) exceeds memory limit (850 GB) [IndexRegistry]: Exceeded disk or memory use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph. [IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing with GBWT unfolding. [IndexRegistry]: Constructing GCSA/LCP indexes. PathGraphBuilder::write(): Memory use of file 0 of kmer paths (850.017 GB) exceeds memory limit (850 GB) PathGraphBuilder::write(): Memory use of file 0 of kmer paths (850.06 GB) exceeds memory limit (850 GB) [IndexRegistry]: Exceeded disk or memory use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph. [IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing with GBWT unfolding. [IndexRegistry]: Constructing GCSA/LCP indexes. PathGraphBuilder::write(): Memory use of file 0 of kmer paths (850.039 GB) exceeds memory limit (850 GB) PathGraphBuilder::write(): Memory use of file 0 of kmer paths (850.082 GB) exceeds memory limit (850 GB) [IndexRegistry]: Exceeded disk or memory use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph. [IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing with GBWT unfolding. [IndexRegistry]: Constructing GCSA/LCP indexes. DiskIO::write(): Write failed DiskIO::write(): You may have run out of temporary disk space at /home/test/nvmedata2/02.Pantrans/TMP [IndexRegistry]: Unrecoverable error in GCSA2 indexing.

vgteam / vg

Memory limit exceeded during vg autoindex for GCSA/LCP indexing #4404