vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Error in indexing a .vg file #3295

Open userzxyz opened 3 years ago

userzxyz commented 3 years ago

Hello,

I have used pggb to make graph genome for each chromosome separately and have all the related output files. I am now trying to use .gfa output from pggb to convert to .vg for further work:

vg view -F -v chr1.gfa > output_chr1.vg
vg mod -X 256 output_chr1.vg > output_chr1_1.vg  
vg index -x output_chr1_1.xg -g output_chr1_1.gcsa -k 16 output_chr1_1.vg

I get an error at the last command line above and the job gets killed. Can you please help with this? Thank you!

glennhickey commented 3 years ago

I think your best bet is to try vg autoindex directly on the gfa.

jeizenga commented 3 years ago

I would also recommend vg autoindex, but if the problem persists it would help us to see the error you're getting.

userzxyz commented 3 years ago

Thank you! I tried: vg autoindex --workflow map --prefix /path/chr1 --gfa chr1.gfa

However, it gives one error:

error:[vg] command autoindex not found
vg: variation graph tool, version v1.20.0 "Ginestra"
glennhickey commented 3 years ago

https://github.com/vgteam/vg/releases/tag/v1.32.0

userzxyz commented 3 years ago

Thank you! By using: vg autoindex --workflow map --prefix /path/chr1 --gfa chr1.gfa

I now got this error: Found kmer with offset >= 1024. GCSA2 cannot handle nodes greater than 1024 bases long. To enable indexing, modify your graph usingvg mod -X 256 x.vg >y.vg. TATAGGACGGACCATC 177596:102177596:1040

glennhickey commented 3 years ago

That's not very auto. You'll have to run vg mod -X 256 chr1.gfa > chr1_chop.gfa first, it seems.

userzxyz commented 3 years ago

Thank you! I tried:

vg mod -X 256 chr1.gfa > chr1_chop.gfa
vg autoindex --workflow map --prefix /path/chr1 --gfa chr1_chop.gfa

and got the same error as I got using vg file:

[vg autoindex] Excecuting command: vg autoindex --workflow map --prefix /path/chr1 --gfa chr1_chop.gfa
[IndexRegistry]: Constructing VG graph from GFA input.
[IndexRegistry]: Constructing XG graph from GFA input.
[IndexRegistry]: Pruning complex regions of VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
Killed
jeizenga commented 3 years ago

Hmm, yeah, that looks like a bug. I'll start an issue for it.

jeizenga commented 3 years ago

Oh, actually, would you mind also pasting whatever you see with vg version?

glennhickey commented 3 years ago

I think the underlying issue is that graphs from pggb can be too complex for gcsa, even after the standard pruning is run. I know that @jmonlong spent some time trying to get (early versions of) pggb graphed indexed this way, and it was a struggle. The good news is that these graphs could be mapped to with giraffe, whose indexes you can make with --workflow giraffe

userzxyz commented 3 years ago

@jeizenga, vg version:

vg version v1.32.0 "Sedlo"
Compiled with g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 on Linux
Linked against libstd++ 20200808
Built by anovak@octagon
jeizenga commented 3 years ago

Are you building the indexes in an environment with limited tmp disk storage? GCSA2 indexing writes some rather large temporary files to disk during the indexing process.

userzxyz commented 3 years ago

@glennhickey, As per your suggestion, I tried:

vg autoindex --workflow giraffe --prefix /path/chr1 --gfa chr1_chop.gfa

It ran without any error:

[vg autoindex] Excecuting command: vg autoindex --workflow giraffe --prefix /path/chr1 --gfa chr1_chop.gfa
[IndexRegistry]: Constructing XG graph from GFA input.
[IndexRegistry]: Finding snarls in graph.
[IndexRegistry]: Constructing distance index.
[IndexRegistry]: Constructing a greedy path cover GBWT
[IndexRegistry]: Constructing GBWTGraph.
[IndexRegistry]: Constructing minimizer index.

But did not give any output file. I am not sure if I messed up something.

userzxyz commented 3 years ago

@jeizenga, I am running in it terminal using a cluster. I have limited storage on cluster but I did not specify any tmp disk in particular.

jeizenga commented 3 years ago

That command would have put the output in a directory path inside the root directory. Did you mean to do that? If you wanted to put the output in your current working directory, the prefix would just be chr1.

userzxyz commented 3 years ago

Thank you! I think that was the problem. Using, vg autoindex --workflow giraffe --prefix chr1 --gfa chr1_chop.gfa has worked. However, previously after indexing I used following steps:

vg map -f sample_R1_paired.fastq.gz sample_R2_paired.fastq.gz -x output_chr1.xg -g output_chr1.gcsa > aln_chr1.gam
vg surject -x output_chr1.xg -b aln_chr1.gam > aln_chr1.bam
vg pack -x output_chr1.xg -g aln_chr1.gam -Q 5 -o aln_1.pack
vg call output_chr1.xg -k aln_1.pack > graph_chr1_calls.vcf

But I could find how to proceed to the mapping step using output files from vg autoindex --workflow giraffe --prefix chr1 --gfa chr1_chop.gfa. Can you please help me find the documentation or right command to use. Thank you!

ekg commented 3 years ago

I'll take current pggb graphs through indexing. If we are running out of memory in GCSA2, then it seems that the automatic pruning is failing to remove all of the complex regions. In general, these graphs are only locally complex.

What species is this?

userzxyz commented 3 years ago

Hi @ekg, its Oryza sativa. Thank you!

userzxyz commented 3 years ago

Hi, I tried indexing pggbgraphs directly using

vg view -F -v smooth.gfa > output_chr1.vg
vg mod -X 256 output_chr1.vg > output_chr1_1.vg  
vg index -x output_chr1_1.xg -g output_chr1_1.gcsa -k 16 output_chr1_1.vg

I get the error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  sd_vector_builder: requested capacity is larger than vector size.

Thank you!

ekg commented 3 years ago

You definitely want to use vg autoindex

On Thu, May 27, 2021, 22:56 userzxyz @.***> wrote:

Hi, I tried indexing pggb graphs directly using

vg view -F -v smooth.gfa > output_chr1.vg vg mod -X 256 output_chr1.vg > output_chr1_1.vg vg index -x output_chr1_1.xg -g output_chr1_1.gcsa -k 16 output_chr1_1.vg

I get the error:

terminate called after throwing an instance of 'std::runtime_error' what(): sd_vector_builder: requested capacity is larger than vector size.

Thank you!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3295#issuecomment-849936186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEKRS7DJGJV5MTUBPJTTP2WZTANCNFSM45FCAG5Q .

userzxyz commented 3 years ago

Hi, I created the index using:

vg mod -X 256 smooth.gfa > chr1_chop.gfa 
vg autoindex --workflow giraffe --prefix vg_chr1 --gfa chr1_chop.gfa

It output following files: vg_chr1.dist, vg_chr1.gg, vg_chr1.giraffe.gbwt, vg_chr1.min Then used: vg map -f sample_R1_paired.fastq.gz sample_R2_paired.fastq.gz -x chr1_chop.gfa -g vg_chr1.giraffe.gbwt > aln_chr1.gam

and got an error: error[VPKG::load_one]: Correct input type not found while loading gcsa::GCSA

Can you please help on which file to use for vg map. Thank you!

jltsiren commented 3 years ago

I kind of have an idea where the

terminate called after throwing an instance of 'std::runtime_error'
  what():  sd_vector_builder: requested capacity is larger than vector size.

error may come from. The issue will be fixed in the next vg release. Meanwhile, you can probably avoid the issue by choosing a smaller value than 256 bp for the maximum node length in the vg mod command.

Some time ago, I replaced the index GCSA construction uses for finding the first occurrence of each value. The old index had two sd_vectors, but the new index uses a single vector with multiset semantics. If the number of values is larger than the range of potential values, the sanity check gets triggered, even though it's unnecessary with multiset semantics. This is fixed in the latest master of vgteam SDSL, but it will take some time to propagate to a vg release.

Most of the graphs we index have low average node length (typically 10-15 bp). Prefix-doubling rarely expands them more than 1.5x, so the indexes GCSA construction uses tend to be below 2% of the maximum capacity. If you have chopped nodes to 256 bp and most of them are near the maximum, there is only room for 4x expansion during prefix-doubling. That's higher than usual but still somewhat reasonable, so using -X 128 or -X 64 may help.

userzxyz commented 3 years ago

@jltsiren, thank you! But even after using -64 or -128 for vg mod, the job gets killed in index step. However, auto index giraffe works and creates the output file. I am just not been able to find how to use auto index giraffe output files for mapping purposes. Thank you!

jltsiren commented 3 years ago

If the vg index job gets killed without any error message, you have probably run out of memory or disk space because the graph is too complex. If you want to check whether this is the case, the progress information from vg index option -p may help. You can try to simplify the graph vg prune, but vg autoindex should run it automatically.

If you want to run vg map, you need to use option --workflow map in vg autoindex to get the right indexes. Option --workflow giraffe builds indexes for vg giraffe.

userzxyz commented 3 years ago

Thank you! I have chromosome wise graphs and gfa files . After using autoindex giraffe, I am trying to use vg giraffe for alignment because I am not able to generate an index file to use for vg map. Can you suggest something to proceed? Thank you!

jltsiren commented 3 years ago

The basic Giraffe command line is

vg giraffe -G graph.gg -m graph.min -H graph.gbwt -d graph.dist -f reads.fq -i > output.gam

for interleaved fastq and

vg giraffe -G graph.gg -m graph.min -H graph.gbwt -d graph.dist -f reads1.fq -f reads2.fq > output.gam

for two fastq files.

You can limit the number of mapping threads with -t N, switch the output to GAF with -o GAF, and print some progress information with -p.