vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

VG autoindex, giraffe, pack and call workflow #3863

Closed Yutang-ETH closed 1 year ago

Yutang-ETH commented 1 year ago

Hello VG team,

Thank you very much for your support and I apologize if this is not the right place to ask for help.

I am trying to genotype some variants using vg autoindex + giraffe + pack + call workflow, and according to #3369, it is suggested we need to add -R "VG w/ Variant Paths" to vg autoindex to keep alt path in the resulting vg graph. This works perfectly with a single chromosome because autoindex generates one vg graph file, for example, chr1.varpaths.vg, and I could use this chr1.varpaths.vg for vg giraffe, pack and call. However, my reads are whole genome sequencing reads, and it is not a good practice to map whole genome reads to just one chromosome to genotype variants.

Therefore, now, I would like to genotype all chromosomes in one shot, and I fed all chromosome vcf files into autoindex with -R "VG w/ Variant Paths", however, I got several vg graph files, each representing one chromosome. With all these vg graph files, I got confused and didn't know how to continue with downstream vg girafffe, pack and call.

Here is an example using the test datasets available on vg github, vg/test/small/:

vg autoindex --workflow giraffe --prefix mytest --ref-fasta xy.fa --vcf xy.vcf.gz --threads 10 -R "VG w/ Variant Paths"

This command returned me two vg files: mytest.0.varpaths.vg mytest.1.varpaths.vg

Could you please give me some suggestions on this workflow? How could continue with several vg files?

By the way, I also tried the workflow described in #3560 with all chromosomes, however, vg pack crashed with the same error (Signal 6 occured) (but this workflow works fine with the test dataset for all chromosomes) and using giraffe.gbz from autoindex for vg call to genotype variants doesn't work as there is no alt path in giraffe.gbz.

I am using vg version v1.43.0 "Barisano".

Thank you very much for your help.

Best wishes, Yutang

glennhickey commented 1 year ago

My first suggestion: Upgrade to the latest vg release.

If I run your command from vg/test

vg autoindex --workflow giraffe --prefix mytest --ref-fasta small/xy.fa --vcf small/xy.vcf.gz --threads 10 -R "VG w/ Variant Paths"

I get

-rw-rw-r-- 1 hickey hickey     8916 Feb 20 09:43 mytest.0.varpaths.vg
-rw-rw-r-- 1 hickey hickey     9652 Feb 20 09:43 mytest.1.varpaths.vg
-rw-rw-r-- 1 hickey hickey     4904 Feb 20 09:43 mytest.dist
-rw-rw-r-- 1 hickey hickey     4048 Feb 20 09:43 mytest.giraffe.gbz
-rw-rw-r-- 1 hickey hickey    41192 Feb 20 09:43 mytest.min

While the .vg graphs are split by chromosome, the indexes are for the full genome, and these are the indexes you will want to use for mapping and calling.
If I check the paths, they seem sensible (x, y reference paths and a pair of haplotypes for each).

vg paths -Mv mytest.giraffe.gbz
#NAME   SENSE   SAMPLE  HAPLOTYPE   LOCUS   PHASE_BLOCK SUBRANGE
1#0#y#0 HAPLOTYPE   1   0   y   0   NO_SUBRANGE
1#1#y#0 HAPLOTYPE   1   1   y   0   NO_SUBRANGE
y   GENERIC NO_SAMPLE_NAME  NO_HAPLOTYPE    y   NO_PHASE_BLOCK  NO_SUBRANGE
1#0#x#0 HAPLOTYPE   1   0   x   0   NO_SUBRANGE
1#1#x#0 HAPLOTYPE   1   1   x   0   NO_SUBRANGE
x   GENERIC NO_SAMPLE_NAME  NO_HAPLOTYPE    x   NO_PHASE_BLOCK  NO_SUBRANGE

mytest.giraffe.gbz can now be used as input for vg giraffe, vg pack and vg call. If you want to restrict calling to known haplotypes (good idea), make sure to add -z to vg call.

If you map with vg giraffe on the .gbz but then manually combine the .vg graphs with vg combine and use those results for pack and you get a crash it would be because the reason in that other issue: The node id's in .vg are not compatible with the .gbz due to chopping.

In general, you can now use .gbz files for pretty much any vg command execpt those that modify a graph. But if you modify the graph, you have to regenerate the .gbz anyway for mapping. We are encouraging this transition to .gbz, especially for whole-genome indexes, because it scales much better with the number of haplotype paths than other formats such as xg,vg or gfa.

Hope this helps!

Yutang-ETH commented 1 year ago

Wow, @glennhickey, thank you very very very much for your quick reply and very very very detailed explanation. I think it is clear to me.

I just updated vg to the latest version, 1.46, and I found vg call has the -z option now!

I will try what you suggested and come back to report my results. I am so glad that I asked for help, otherwise I will be wasting time.

Thank you very much again.

Best wishes, Yutang

Yutang-ETH commented 1 year ago

Hi @glennhickey ,

Just over one night, my workflow now is at the vg call stage and so far so good! No errors and warnings, it works like a charm.

Just share what I have done here in case some other users might want to know:

# specify data and prefix
myfa="test.fa"
myvcf="test.vcf.gz"
myprefix="test"

# first autoindex the graph for alignment
vg autoindex --workflow giraffe --prefix ${myprefix} --ref-fasta ${myfa} --vcf ${myvcf} --threads 60 --tmp-dir ../tmp

# giraffe mapping short reads to graph
mkdir mygam
cat sample.lst | parallel -j 5 -k "vg giraffe -Z ${myprefix}.giraffe.gbz -m ${myprefix}.min -d ${myprefix}.dist -f ../GBS/{}.fastq.trim.fq -t 12 -N {} > mygam/{}.gam"

# count read support
mkdir mypack
cat sample.lst | parallel -j 5 -k "vg pack -x ${myprefix}.giraffe.gbz -t 12 -g mygam/{}.gam -o mypack/{}.pack -Q 5 -s 5"

# calculate snarl
vg snarls ${myprefix}.giraffe.gbz > ${myprefix}.snarls

# genotype the variants
mkdir myvcf
cat sample.lst | parallel -j 10 -k "vg call ${myprefix}.giraffe.gbz -z -a -k mypack/{}.pack -s {} -t 6 -r ${myprefix}.snarls > myvcf/{}.vcf"

I used -a in vg call as I would like to compare variants between samples.

Thank you very much again for your kind help.

Best wishes, Yutang