Open prashantpandey opened 5 years ago
You need to build the GBWT of the haplotypes and then extract these as paths, finally merging them back with the original vg file.
On Wed, Sep 4, 2019, 07:51 Prashant Pandey notifications@github.com wrote:
I created a vg representation on the Y chromosome from the 1000Genomes dataset. It contains ~1200 samples. However, when I list paths using the vg paths -L command it only lists the reference Y. It doesn't list all the input samples.
How to construct the vg representation with annotated paths?
Thanks, Prashant
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2427?email_source=notifications&email_token=AABDQELLZZ7M2KAENHC3WSDQH3S6VA5CNFSM4ITLSQA2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HJDM6YQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEJVSWX7BNLF2DSC2HTQH3S6VANCNFSM4ITLSQAQ .
Hi Erik,
Thanks for the quick response. I built gbwt using this command ./bin/vg index -G Y.gbwt -v ./data/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz Y.vg
Could you please tell me what do you mean by extracting these as paths and merging them back with the original vg?
Thanks, Prashant
This is currently the only way to read the threads in the GBWT into paths in the graph:
vg augment -B graph.vg <(vg paths -x graph.xg -g graph.gbwt -X ) >graph.haps.vg
There used to be another way, but it is now broken. See https://github.com/vgteam/vg/issues/2428.
Hi Erik,
I tried the command ./bin/vg augment -B Y.vg <(./bin/vg paths -x Y.xg -g Y.gbwt -X) > Y.haps.v
.
When I try to list paths in the augmented vg file it lists out 10X more paths. There are only 1233 samples in chromosome Y. Also, the names of the paths are not sample names.
± |master U:10 ?:4 ✗| → ./bin/vg paths -L -v Y.haps.vg | wc -l
123865
Sample output [only a part of the output]:
.................
_alt_fff069a3d9e89818d5d67a2df29b70ef7adabee7_1
_alt_fff10edb98c5343c910741eca38b6af42980a594_0
_alt_fff10edb98c5343c910741eca38b6af42980a594_1
_alt_fff251567f16915247a79099f2f1938d8c51c855_0
_alt_fff251567f16915247a79099f2f1938d8c51c855_1
_alt_fff32e71fff4ef42c4f5a46f9dd24327f7c5a6dc_0
_alt_fff32e71fff4ef42c4f5a46f9dd24327f7c5a6dc_1
_alt_fff3f34637888e6da90e364e9be75950a8be0c2f_0
_alt_fff3f34637888e6da90e364e9be75950a8be0c2f_1
_alt_fff4cd6938c522ab58683916e2e65c47f81b9721_0
_alt_fff4cd6938c522ab58683916e2e65c47f81b9721_1
.................
However, looking into the gbwt file gives expected answers in terms of number of paths.
± |master U:10 ?:4 ✗| → ./bin/vg gbwt -M Y.gbwt
1233 sample(s), 2466 haplotype(s), 1 contig(s)
Do you know what's wrong here?
Thanks, Prashant
Try this:
./bin/vg paths -L -v Y.haps.vg | cut -f 2 -d _ | sort | uniq
What do you get?
On Wed, Sep 4, 2019 at 7:14 PM Prashant Pandey notifications@github.com wrote:
Hi Erik,
I tried the command ./bin/vg augment -B Y.vg <(./bin/vg paths -x Y.xg -g Y.gbwt -X) > Y.haps.v. When I try to list paths in the augmented vg file it lists out 10X more paths. There are only 1233 samples in chromosome Y. Also, the names of the paths are not sample names.
± |master U:10 ?:4 ✗| → ./bin/vg paths -L -v Y.haps.vg | wc -l
123865
Sample output [only a part of the output]:
.................
_alt_fff069a3d9e89818d5d67a2df29b70ef7adabee7_1
_alt_fff10edb98c5343c910741eca38b6af42980a594_0
_alt_fff10edb98c5343c910741eca38b6af42980a594_1
_alt_fff251567f16915247a79099f2f1938d8c51c855_0
_alt_fff251567f16915247a79099f2f1938d8c51c855_1
_alt_fff32e71fff4ef42c4f5a46f9dd24327f7c5a6dc_0
_alt_fff32e71fff4ef42c4f5a46f9dd24327f7c5a6dc_1
_alt_fff3f34637888e6da90e364e9be75950a8be0c2f_0
_alt_fff3f34637888e6da90e364e9be75950a8be0c2f_1
_alt_fff4cd6938c522ab58683916e2e65c47f81b9721_0
_alt_fff4cd6938c522ab58683916e2e65c47f81b9721_1
.................
Do you know what's wrong here?
Thanks, Prashant
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2427?email_source=notifications&email_token=AABDQEILQE5XCXV2VMGKFU3QH6DBPA5CNFSM4ITLSQA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD53CWDA#issuecomment-527837964, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEO3IHLFG3NCHWAJ4WTQH6DBPANCNFSM4ITLSQAQ .
± |master U:10 ?:4 ✗| → ./bin/vg paths -L -v Y.haps.vg | cut -f 2 -d _ | sort | uniq
alt
Y
Please follow these instructions to build and index the graph: https://github.com/vgteam/vg/wiki/Index-Construction#with-haplotypes
vg construct -r reference.fa -v variants.vcf.gz -a > graph.vg
vg index -x graph.xg -G graph.gbwt -v variants.vcf.gz -g graph.gcsa graph.vg
Hi Erik,
Here's the list of commands I executed to build an annotated graph for chromosome Y based on the wiki link you shared. The output is still the same. It doesn't give the list of 1233 samples.
Please let me know what's wrong.
$ ./bin/vg construct -r ./data/Homo_sapiens.GRCh37.dna.chromosome.Y.fa -v ./data/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz -a > Y.vg
$ ./bin/vg index -x Y.xg -G Y.gbwt -v ./data/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz Y.vg
$ ./bin/vg augment -B Y.vg <(./bin/vg paths -x Y.xg -g Y.gbwt -X) > Y.haps.vg
$ ./bin/vg paths -L -v Y.haps.vg > Y.paths
$ wc -l Y.paths
123865
Y.paths
contains the following output..
Y
_alt_0006c45e8075b7f343b922a92993ccf852a84a0a_0
_alt_0006c45e8075b7f343b922a92993ccf852a84a0a_1
_alt_00070d2bd741f434c3677fa54d9ed22af2f47346_0
_alt_00070d2bd741f434c3677fa54d9ed22af2f47346_1
_alt_00072e613c5eae5375278eb1d8c610c2f45285e7_0
_alt_00072e613c5eae5375278eb1d8c610c2f45285e7_1
_alt_0007d14388dc30fcedc68f6f88195e8577adca26_0
_alt_0007d14388dc30fcedc68f6f88195e8577adca26_1
_alt_00090bf957055fb274cc70560b1e83d0d232241a_0
_alt_00090bf957055fb274cc70560b1e83d0d232241a_1
_alt_000be1f8054ef5384d02adbb02a4397b51130d4a_0
_alt_000be1f8054ef5384d02adbb02a4397b51130d4a_1
_alt_000c5907319de341651b11726782e5837e52dede_0
_alt_000c5907319de341651b11726782e5837e52dede_1
_alt_000d4f6841a098ed925e75d479d11a48f95f766d_0
_alt_000d4f6841a098ed925e75d479d11a48f95f766d_1
_alt_000dcbb1d8984d7f36d5966d338f25291cece41d_0
_alt_000dcbb1d8984d7f36d5966d338f25291cece41d_1
_alt_000dda25331c9bd09e60fa89e987d10806d051f4_0
_alt_000dda25331c9bd09e60fa89e987d10806d051f4_1
_alt_000e815fce3b73481356cbb181c1a04f6d17d527_0
_alt_000e815fce3b73481356cbb181c1a04f6d17d527_1
_alt_000f2ff8ac187d81f97bbf03f4857dd4eb9e80de_0
_alt_000f2ff8ac187d81f97bbf03f4857dd4eb9e80de_1
_alt_000ffc0c236eca579f1eee6927bbe8833350492f_0
_alt_000ffc0c236eca579f1eee6927bbe8833350492f_1
_alt_0010359b80e9c220ce5066ee7c28a278f8dffe9c_0
_alt_0010359b80e9c220ce5066ee7c28a278f8dffe9c_1
_alt_0010b0db34e48fad59280139285f29e4436ff575_0
_alt_0010b0db34e48fad59280139285f29e4436ff575_1
_alt_0013c6e86667f0633e051f6d7d09d4ff01a84ccd_0
_alt_0013c6e86667f0633e051f6d7d09d4ff01a84ccd_1
.....................................................
Thanks, Prashant
vg construct -a
adds a bunch of alt paths (_alt_*
) to the graph. These paths are used to create the GBWT index, but aren't of much use here beyond that. You can grep them out of your output or just drop them from your graph:.
# drop the alt paths by retaining only the reference path
vg mod -r Y Y.vg > Y_no_alts.vg
# continue as before
vg augment -B Y_no_alts.vg <(vg paths -x Y.xg -g Y.gbwt -X) > Y.haps.vg
Some samples may still have more than one path (should be obvious from the path names). This can happen when there's an issue with the phasing information that breaks a haplotype thread into pieces.
Thanks, Glenn! I tried that.
./bin/vg mod -r Y Y.vg > Y_no_alts.vg
./bin/vg augment -B Y_no_alts.vg <(./bin/vg paths -x Y.xg -g Y.gbwt -X) > Y.haps.vg
./bin/vg paths -L -v Y.haps.vg
Y
I only see the reference in the path list.
What do you get from vg paths -x Y.xg -g Y.gbwt -L
? I get 2129 paths of the form:
_thread_HG01164_Y_0_0
_thread_HG01164_Y_0_1
_thread_HG01054_Y_0_0
_thread_HG01072_Y_0_0
_thread_HG01088_Y_0_0
_thread_HG01094_Y_0_0
_thread_HG01107_Y_0_0
_thread_HG01110_Y_0_0
_thread_HG01161_Y_0_0
_thread_HG01187_Y_0_0
etc
Hi @glennhickey ,
I still only get the reference chromosome Y as the result.
Could you please send me the list of commands you used to get this output? I can try the exact same commands and see if it works.
Here are my exact commands. They show there are 2129 haplotype threads in the GBWT. I used vg's current master branch
wget http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz
wget http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz.tbi
wget ftp://ftp.ensembl.org/pub/release-56/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.56.dna.chromosome.Y.fa.gz
gzip -d Homo_sapiens.GRCh37.56.dna.chromosome.Y.fa.gz
VCF=ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz
FA=Homo_sapiens.GRCh37.56.dna.chromosome.Y.fa
vg construct -r $FA -v $VCF -a > Y.vg
vg index -x Y.xg -G Y.gbwt -v $VCF Y.vg
vg paths -x Y.xg -g Y.gbwt -L | wc -l
2129
The key thing is that the VCF should have phased genotypes in it. If that's not the case then you may get a result like this.
On Thu, Sep 5, 2019, 02:01 Glenn Hickey notifications@github.com wrote:
Here are my exact commands. They show there are 2129 haplotype threads in the GBWT. I used vg's current master branch
wget http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz wget http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz.tbi wget ftp://ftp.ensembl.org/pub/release-56/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.56.dna.chromosome.Y.fa.gz gzip -d Homo_sapiens.GRCh37.56.dna.chromosome.Y.fa.gz
VCF=ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz FA=Homo_sapiens.GRCh37.56.dna.chromosome.Y.fa
vg construct -r $FA -v $VCF -a > Y.vg vg index -x Y.xg -G Y.gbwt -v $VCF Y.vg vg paths -x Y.xg -g Y.gbwt -L | wc -l 2129
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2427?email_source=notifications&email_token=AABDQEJOVLEFQTHTFRWQTF3QH7SW3A5CNFSM4ITLSQA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD54INZI#issuecomment-527992549, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQELZ6DEIVXHQSRCBG7TQH7SW3ANCNFSM4ITLSQAQ .
Hi @ekg and @glennhickey,
So, what is the total space requirement for VG if I want to build a graph with annotations?
Is it the space needed by .vg
+ .xg
+ .gbwt
?
Thanks, Prashant
I'm not sure, but it's typically around 50G for a whole genome with a large number of small variants. It really depends. Did you resolve your problem?
Actually, no. I don't know why I can't reproduce what @glennhickey got.
./bin/vg construct -r ./data/Homo_sapiens.GRCh37.dna.chromosome.Y.fa -v ./data/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz -a > Y.vg
./bin/vg index -x Y.xg -G Y.gbwt -v ./data/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz Y.vg
./bin/vg paths -x Y.xg -g Y.gbwt -L | wc -l
1
Also, I want to compute the total space required by VG tool to represent a variation graph with sample annotations (i.e., all sample paths). So, I need some way to either create a .vg
file with annotations or add up the space needed by .vg
+ .xg
+ .gbwt
files?
Please let me know which one is the correct (or better) approach.
Thanks, Prashant
If my commands aren't working for you, I'd guess there's something wrong with the vg version you're using. Which is it? Are you able to try with the lastest master branch? For reference, here are my file sizes.
-rw-r--r-- 1 hickey hickey 5837990 Apr 6 2015 ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz
-rw-r--r-- 1 hickey hickey 8075 Apr 7 2015 ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz.tbi
-rw-r--r-- 1 hickey hickey 60363183 Sep 4 12:58 Homo_sapiens.GRCh37.56.dna.chromosome.Y.fa
-rw-r--r-- 1 hickey hickey 20 Sep 4 12:59 Homo_sapiens.GRCh37.56.dna.chromosome.Y.fa.fai
-rw-r--r-- 1 hickey hickey 710 Sep 4 12:58 README
-rw-r--r-- 1 hickey hickey 51669864 Sep 4 13:03 Y.gbwt
-rw-r--r-- 1 hickey hickey 139665434 Sep 4 12:59 Y.vg
-rw-r--r-- 1 hickey hickey 215059703 Sep 4 13:03 Y.xg
I am at this commit on the master branch:
commit b3ed0145a0b03589f65857fe309697017356795f (HEAD -> master, origin/master, origin/HEAD)
Merge: 116eb4cd7 9d6137e46
Author: Adam Novak <anovak@soe.ucsc.edu>
Date: Wed Aug 28 10:00:30 2019 -0700
Merge pull request #2415 from vgteam/update-lru
Uplade lru_cache to one that should avoid deprecation warnings
-rw-r--r-- 1 ppandey ppandey 5718913 Sep 3 18:49 ../variantdb/data/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz
-rw-r--r-- 1 ppandey ppandey 6562 Sep 3 18:49 ../variantdb/data/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi
-rw-r--r-- 1 ppandey ppandey 60363187 Nov 27 2015 ../variantdb/data/Homo_sapiens.GRCh37.dna.chromosome.Y.fa
-rw-r--r-- 1 ppandey ppandey 20 May 2 16:35 ../variantdb/data/Homo_sapiens.GRCh37.dna.chromosome.Y.fa.fai
-rw-rw-r-- 1 ppandey ppandey 13900761 Sep 5 08:49 Y.gbwt
-rw-rw-r-- 1 ppandey ppandey 30338927 Sep 5 08:42 Y.vg
-rw-rw-r-- 1 ppandey ppandey 35591246 Sep 5 08:49 Y.xg
It's suspicious that your input files are of (slightly) different sizes than mine. And it's clear that your .vg graph is completely different (30MB vs 140MB), so the problem at least seems to start with the vg construct
command. What outputs do you get for the following?
vg stats -NE Y.vg
2007047
2069941
vg paths -v Y.vg -L | wc -l
123865
./bin/vg stats -NE Y.vg
2007047
2069941
./bin/vg paths -v Y.vg -L | wc -l
123865
Hmm. @adamnovak have our .vg graphs gotten bigger recently? As for the issue, I can't really think of anything other than trying a fresh build of the latest master branch.
@glennhickey I know we turned off gzip compression of output files in a lot of cases, because @jltsiren said it was slowing down index loads. It looks like right now construct is set to output uncompressed graphs (see: false
).
Hi @ekg , @glennhickey , and @adamnovak ,
Is it fair to compute the total space required to store the variation graph with path annotations as the sum of the sizes of .vg
+ .xg
+ .gbwt
files?
Thanks, Prashant
Just .xg and .gbwt are sufficient to reproduce the entire graph and a linked set of haplotypes.
If you don't have haplotype threads, then the .xg is sufficient.
Both of these are in-memory self indexed data structures that are around the same size on disk.
If you're just interested in the amount of storage of disk then a compressed GFA or .vg file would be equivalent.
You might be interested in https://github.com/vgteam/libbdsg, which pulls together a number of succinct dynamic variation graph models. One such implementation is https://github.com/vgteam/odgi.
On Sun, Sep 8, 2019 at 9:14 PM Prashant Pandey notifications@github.com wrote:
Hi @ekg https://github.com/ekg , @glennhickey https://github.com/glennhickey , and @adamnovak https://github.com/adamnovak ,
Is it fair to compute the total space required to store the variation graph with path annotations as the sum of the sizes of .vg + .xg + .gbwt files?
Thanks, Prashant
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2427?email_source=notifications&email_token=AABDQEO66MQUF6STPHZDSUDQIVMK7A5CNFSM4ITLSQA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6FYLTI#issuecomment-529237453, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEIRZVFTQK5FHZR6HM3QIVMK7ANCNFSM4ITLSQAQ .
Thanks, @ekg
I created a vg representation on the Y chromosome from the 1000Genomes dataset. It contains ~1200 samples. However, when I list paths using the
vg paths -L
command it only lists the referenceY
. It doesn't list all the input samples.How to construct the vg representation with annotated paths?
Thanks, Prashant
Adding @ame9yu