vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.07k stars 191 forks source link

privacy preserving genome graphs #2031

Open ekg opened 5 years ago

ekg commented 5 years ago

This is a thread to discuss how to build genome graphs with haplotypes that preserve the privacy of the genomes that the graphs were built from.

As a test data set, I downloaded a piece of chr20 for the 1000GP:

tabix -h -p vcf ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 20:1000000-2000000 | bgzip >ALL.chr20:0-2M.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
tabix -p vcf ALL.chr20:0-2M.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

And I made a mini-reference file for the same region:

samtools faidx ~/ref/hs37d5.fa 20:0-2000000 | sed s/:0-2000000$// >hs37d5.chr20:0-2M.fa
samtools faidx hs37d5.chr20:0-2M.fa

Now we need to build the graph and haplotype indexes for this subset, following @jltsiren 's documentation https://github.com/vgteam/vg/wiki/Index-Construction

vg construct -r hs37d5.chr20:0-2M.fa -v ALL.chr20:0-2M.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -a -p > graph.vg
vg index -x graph.xg -G graph.gbwt -v ALL.chr20:0-2M.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -p graph.vg
ekg commented 5 years ago

It's possible to build the GBWT from alignments (which represent paths through the graph). Here's a simple example of how to do so from the automated tests:

cd vg/test
vg construct -r small/x.fa -v small/x.vcf.gz > x.vg
vg index -x x.xg x.vg
vg sim -n 100 -l 100 -x x.xg -a >sim.gam
vg index -G x_gam.gbwt -M sim.gam -x x_gam.xg x.vg

At this point we have a GBWT that includes the simulated reads. To build custom GBWTs, we only need to write the desired haplotypes in GAM format, which avoids possibly complicated conversion of the haplotypes to VCF.

We can visualize the haplotypes using two methods. In both cases, we need to convert the "threads" in the GBWT to "paths" in the VG graph. We do this with vg paths:

(cat x.vg ; vg paths -g x_gam.gbwt -x x_gam.xg -T -V ) >x+.vg

# method one, graphviz
vg view -dp x+.vg | dot -Tpdf -o x+.pdf

# method two, native linear visualization
vg index -x x+.xg x+.vg
vg viz -x x+.vg -o x+.svg