Closed canholyavkin closed 5 years ago
Sure.
library(vcfR)
#>
#> ***** *** vcfR *** *****
#> This is vcfR 1.8.0.9000
#> browseVignettes('vcfR') # Documentation
#> citation('vcfR') # Citation
#> ***** ***** ***** *****
data("vcfR_test")
# The example data is multi sample.
vcfR_test@gt
#> FORMAT NA00001 NA00002 NA00003
#> [1,] "GT:GQ:DP:HQ" "0|0:48:1:51,51" "1|0:48:8:51,51" "1/1:43:5:.,."
#> [2,] "GT:GQ:DP:HQ" "0|0:49:3:58,50" "0|1:3:5:65,3" "0/0:41:3"
#> [3,] "GT:GQ:DP:HQ" "1|2:21:6:23,27" "2|1:2:0:18,2" "2/2:35:4"
#> [4,] "GT:GQ:DP:HQ" "0|0:54:7:56,60" "0|0:48:4:51,51" "0/0:61:2"
#> [5,] "GT:GQ:DP" "0/1:35:4" "0/2:17:2" "1/1:40:3"
# And we can add samples
vcfR_test@gt <- cbind(vcfR_test@gt, vcfR_test@gt[,-1])
vcfR_test@gt
#> FORMAT NA00001 NA00002 NA00003
#> [1,] "GT:GQ:DP:HQ" "0|0:48:1:51,51" "1|0:48:8:51,51" "1/1:43:5:.,."
#> [2,] "GT:GQ:DP:HQ" "0|0:49:3:58,50" "0|1:3:5:65,3" "0/0:41:3"
#> [3,] "GT:GQ:DP:HQ" "1|2:21:6:23,27" "2|1:2:0:18,2" "2/2:35:4"
#> [4,] "GT:GQ:DP:HQ" "0|0:54:7:56,60" "0|0:48:4:51,51" "0/0:61:2"
#> [5,] "GT:GQ:DP" "0/1:35:4" "0/2:17:2" "1/1:40:3"
#> NA00001 NA00002 NA00003
#> [1,] "0|0:48:1:51,51" "1|0:48:8:51,51" "1/1:43:5:.,."
#> [2,] "0|0:49:3:58,50" "0|1:3:5:65,3" "0/0:41:3"
#> [3,] "1|2:21:6:23,27" "2|1:2:0:18,2" "2/2:35:4"
#> [4,] "0|0:54:7:56,60" "0|0:48:4:51,51" "0/0:61:2"
#> [5,] "0/1:35:4" "0/2:17:2" "1/1:40:3"
Created on 2019-05-29 by the reprex package (v0.2.1)
However, I would not suggest it unless you're sure you know what you're doing. VCF files contain only variable positions. If you have two VCF files for two individuals the variants are probably different. This means my example is overly simple. Its going to be your responsibility to make sure you have the same variants in the same rows of each data source. So I think the suggested path would be for you to call variants over multiple samples. That way the variant caller has to figure this out and not you.
Good luck! Brian
Dear Brian,
Thank you for the clarification. As far as I understand, vcfR doesn't provide merge the genotypes based on genomic locations. It only binds the rows according to row order, right?
You are correct. With vcfR you can bind columns (or rows). With a bit of programming you should be able to come up with an inventory of all the genomic locations in all of your VCF files. But if that sounds like work then you're probably better off finding a different path. I suggest you go back to your variant calling step because any post-hoc method will only be able to add missing data while going back to the variant calling step you should get information about how the reference allele was called in the samples that were not variable at a position that other samples were variable at.
Good luck! Brian
Is it possible to create a multi-sample vcf file using vcfR package?