vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

index whole-genome variation graph #3323

Open Yexin-Zhang opened 3 years ago

Yexin-Zhang commented 3 years ago

Hi, I want to construct a whole-genome variation graph with vgtools. The reference genome I have contains more than 3,200 scaffolds, and I don't want to abandon them while constructing the graph.

I follow the tutorial here: https://github.com/vgteam/vg/wiki/Working-with-a-whole-genome-variation-graph , and put each scaffold as a separate chromosome. The construction, id, XG indexing all went fine, but the GCSA index always gave me error at step1. Is there any method that I could include all the scaffolds in my whole-genome variation graph?

Also, using the method in toturial to construct graph for whole-genome, I could get multiple .vg file for each chromosome, and one .xg file, and one .gcsa file. With these files, Is there any way that I could add new variants from a VCF file to the graph without redoing everything?

Many thanks, Monica

jeizenga commented 3 years ago

Unfortunately, no, I don't think so. If you have new variants you want to add into the graph, I think the most direct solution would be to use bcftools to merge and normalize the VCFs and then repeat the construction pipeline.

Yexin-Zhang commented 3 years ago

Thank you for you reply. Could you please also give me some suggestions on my first question, that whether I could include all of the 3,200 scaffolds in my whole-genome variation graph? Thanks!

jeizenga commented 3 years ago

Yes, that should be fine. There's no real limit on the number of components the graph can have. If you follow the same pipeline as in that wiki you should be able to produce a usable XG with all of the components combined. We've also started redirecting people to vg autoindex, which handles a lot of the pipelining for you. Its interface is based on giving it interchange formats like FASTA and VCF and then telling it which mapping tool you want to use.

Yexin-Zhang commented 3 years ago

Yes, I tried with the pipeline in wiki, and I could produce multiple vg files for each component, and one XG. When it came to GCSA, it always gave error at step one.

I also tried with vg autoindex, and my command is: vg autoindex --workflow map --prefix auto -r $ref -v $var -T $temp -t 32. The error was:

[IndexRegistry]: Checking for phasing in VCF(s).
[IndexRegistry]: Chunking inputs for parallelism.
[IndexRegistry]: Chunking FASTA(s).
[IndexRegistry]: Chunking VCF(s).
[IndexRegistry]: Constructing VG graph from FASTA and VCF input.
[IndexRegistry]: Constructing GBWT from VG graph and phased VCF input.
error: [HaplotypeIndexer::parse_vcf] the variant file does not contain phasings

I was doing it with an unphased VCF, does autoindex only take phased VCF? I am using vg version v1.32.0 "Sedlo".

Thank you for your help!

jeizenga commented 3 years ago

Ah, yes, I know that bug. It should be fixed in v1.33.0

Yexin-Zhang commented 3 years ago

Thank you @jeizenga ! I will try autoindex again with the latest version.