Open userzxyz opened 3 years ago
I think your best bet is to try vg autoindex
directly on the gfa.
I would also recommend vg autoindex
, but if the problem persists it would help us to see the error you're getting.
Thank you! I tried:
vg autoindex --workflow map --prefix /path/chr1 --gfa chr1.gfa
However, it gives one error:
error:[vg] command autoindex not found
vg: variation graph tool, version v1.20.0 "Ginestra"
Thank you! By using:
vg autoindex --workflow map --prefix /path/chr1 --gfa chr1.gfa
I now got this error:
Found kmer with offset >= 1024. GCSA2 cannot handle nodes greater than 1024 bases long. To enable indexing, modify your graph using
vg mod -X 256 x.vg >y.vg. TATAGGACGGACCATC 177596:102177596:1040
That's not very auto. You'll have to run vg mod -X 256 chr1.gfa > chr1_chop.gfa
first, it seems.
Thank you! I tried:
vg mod -X 256 chr1.gfa > chr1_chop.gfa
vg autoindex --workflow map --prefix /path/chr1 --gfa chr1_chop.gfa
and got the same error as I got using vg file:
[vg autoindex] Excecuting command: vg autoindex --workflow map --prefix /path/chr1 --gfa chr1_chop.gfa
[IndexRegistry]: Constructing VG graph from GFA input.
[IndexRegistry]: Constructing XG graph from GFA input.
[IndexRegistry]: Pruning complex regions of VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
Killed
Hmm, yeah, that looks like a bug. I'll start an issue for it.
Oh, actually, would you mind also pasting whatever you see with vg version
?
I think the underlying issue is that graphs from pggb can be too complex for gcsa, even after the standard pruning is run. I know that @jmonlong spent some time trying to get (early versions of) pggb graphed indexed this way, and it was a struggle. The good news is that these graphs could be mapped to with giraffe, whose indexes you can make with --workflow giraffe
@jeizenga, vg version:
vg version v1.32.0 "Sedlo"
Compiled with g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 on Linux
Linked against libstd++ 20200808
Built by anovak@octagon
Are you building the indexes in an environment with limited tmp
disk storage? GCSA2 indexing writes some rather large temporary files to disk during the indexing process.
@glennhickey, As per your suggestion, I tried:
vg autoindex --workflow giraffe --prefix /path/chr1 --gfa chr1_chop.gfa
It ran without any error:
[vg autoindex] Excecuting command: vg autoindex --workflow giraffe --prefix /path/chr1 --gfa chr1_chop.gfa
[IndexRegistry]: Constructing XG graph from GFA input.
[IndexRegistry]: Finding snarls in graph.
[IndexRegistry]: Constructing distance index.
[IndexRegistry]: Constructing a greedy path cover GBWT
[IndexRegistry]: Constructing GBWTGraph.
[IndexRegistry]: Constructing minimizer index.
But did not give any output file. I am not sure if I messed up something.
@jeizenga, I am running in it terminal using a cluster
. I have limited storage on cluster but I did not specify any tmp
disk in particular.
That command would have put the output in a directory path
inside the root directory. Did you mean to do that? If you wanted to put the output in your current working directory, the prefix would just be chr1
.
Thank you! I think that was the problem. Using, vg autoindex --workflow giraffe --prefix chr1 --gfa chr1_chop.gfa
has worked. However, previously after indexing I used following steps:
vg map -f sample_R1_paired.fastq.gz sample_R2_paired.fastq.gz -x output_chr1.xg -g output_chr1.gcsa > aln_chr1.gam
vg surject -x output_chr1.xg -b aln_chr1.gam > aln_chr1.bam
vg pack -x output_chr1.xg -g aln_chr1.gam -Q 5 -o aln_1.pack
vg call output_chr1.xg -k aln_1.pack > graph_chr1_calls.vcf
But I could find how to proceed to the mapping step using output files from vg autoindex --workflow giraffe --prefix chr1 --gfa chr1_chop.gfa
. Can you please help me find the documentation or right command to use. Thank you!
I'll take current pggb graphs through indexing. If we are running out of memory in GCSA2, then it seems that the automatic pruning is failing to remove all of the complex regions. In general, these graphs are only locally complex.
What species is this?
Hi @ekg, its Oryza sativa. Thank you!
Hi, I tried indexing pggb
graphs directly using
vg view -F -v smooth.gfa > output_chr1.vg
vg mod -X 256 output_chr1.vg > output_chr1_1.vg
vg index -x output_chr1_1.xg -g output_chr1_1.gcsa -k 16 output_chr1_1.vg
I get the error:
terminate called after throwing an instance of 'std::runtime_error'
what(): sd_vector_builder: requested capacity is larger than vector size.
Thank you!
You definitely want to use vg autoindex
On Thu, May 27, 2021, 22:56 userzxyz @.***> wrote:
Hi, I tried indexing pggb graphs directly using
vg view -F -v smooth.gfa > output_chr1.vg vg mod -X 256 output_chr1.vg > output_chr1_1.vg vg index -x output_chr1_1.xg -g output_chr1_1.gcsa -k 16 output_chr1_1.vg
I get the error:
terminate called after throwing an instance of 'std::runtime_error' what(): sd_vector_builder: requested capacity is larger than vector size.
Thank you!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3295#issuecomment-849936186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEKRS7DJGJV5MTUBPJTTP2WZTANCNFSM45FCAG5Q .
Hi, I created the index
using:
vg mod -X 256 smooth.gfa > chr1_chop.gfa
vg autoindex --workflow giraffe --prefix vg_chr1 --gfa chr1_chop.gfa
It output following files: vg_chr1.dist, vg_chr1.gg, vg_chr1.giraffe.gbwt, vg_chr1.min
Then used:
vg map -f sample_R1_paired.fastq.gz sample_R2_paired.fastq.gz -x chr1_chop.gfa -g vg_chr1.giraffe.gbwt > aln_chr1.gam
and got an error:
error[VPKG::load_one]: Correct input type not found while loading gcsa::GCSA
Can you please help on which file to use for vg map
. Thank you!
I kind of have an idea where the
terminate called after throwing an instance of 'std::runtime_error'
what(): sd_vector_builder: requested capacity is larger than vector size.
error may come from. The issue will be fixed in the next vg release. Meanwhile, you can probably avoid the issue by choosing a smaller value than 256 bp for the maximum node length in the vg mod
command.
Some time ago, I replaced the index GCSA construction uses for finding the first occurrence of each value. The old index had two sd_vector
s, but the new index uses a single vector with multiset semantics. If the number of values is larger than the range of potential values, the sanity check gets triggered, even though it's unnecessary with multiset semantics. This is fixed in the latest master of vgteam SDSL, but it will take some time to propagate to a vg release.
Most of the graphs we index have low average node length (typically 10-15 bp). Prefix-doubling rarely expands them more than 1.5x, so the indexes GCSA construction uses tend to be below 2% of the maximum capacity. If you have chopped nodes to 256 bp and most of them are near the maximum, there is only room for 4x expansion during prefix-doubling. That's higher than usual but still somewhat reasonable, so using -X 128
or -X 64
may help.
@jltsiren, thank you! But even after using -64
or -128
for vg mod
, the job gets killed
in index
step. However, auto index giraffe
works and creates the output file. I am just not been able to find how to use auto index giraffe
output files for mapping
purposes. Thank you!
If the vg index
job gets killed without any error message, you have probably run out of memory or disk space because the graph is too complex. If you want to check whether this is the case, the progress information from vg index
option -p
may help. You can try to simplify the graph vg prune
, but vg autoindex
should run it automatically.
If you want to run vg map
, you need to use option --workflow map
in vg autoindex
to get the right indexes. Option --workflow giraffe
builds indexes for vg giraffe
.
Thank you! I have chromosome wise graphs and gfa files . After using autoindex giraffe
, I am trying to use vg giraffe
for alignment because I am not able to generate an index file to use for vg map
. Can you suggest something to proceed? Thank you!
The basic Giraffe command line is
vg giraffe -G graph.gg -m graph.min -H graph.gbwt -d graph.dist -f reads.fq -i > output.gam
for interleaved fastq and
vg giraffe -G graph.gg -m graph.min -H graph.gbwt -d graph.dist -f reads1.fq -f reads2.fq > output.gam
for two fastq files.
You can limit the number of mapping threads with -t N
, switch the output to GAF with -o GAF
, and print some progress information with -p
.
Hello,
I have used pggb to make graph genome for each chromosome separately and have all the related output files. I am now trying to use .gfa output from pggb to convert to .vg for further work:
I get an error at the last command line above and the job gets killed. Can you please help with this? Thank you!