knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

Create Multisample VCF File #136

Closed canholyavkin closed 5 years ago

canholyavkin commented 5 years ago

Is it possible to create a multi-sample vcf file using vcfR package?

knausb commented 5 years ago

Sure.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.8.0.9000 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
data("vcfR_test")
# The example data is multi sample.
vcfR_test@gt
#>      FORMAT        NA00001          NA00002          NA00003       
#> [1,] "GT:GQ:DP:HQ" "0|0:48:1:51,51" "1|0:48:8:51,51" "1/1:43:5:.,."
#> [2,] "GT:GQ:DP:HQ" "0|0:49:3:58,50" "0|1:3:5:65,3"   "0/0:41:3"    
#> [3,] "GT:GQ:DP:HQ" "1|2:21:6:23,27" "2|1:2:0:18,2"   "2/2:35:4"    
#> [4,] "GT:GQ:DP:HQ" "0|0:54:7:56,60" "0|0:48:4:51,51" "0/0:61:2"    
#> [5,] "GT:GQ:DP"    "0/1:35:4"       "0/2:17:2"       "1/1:40:3"
# And we can add samples
vcfR_test@gt <- cbind(vcfR_test@gt, vcfR_test@gt[,-1])
vcfR_test@gt
#>      FORMAT        NA00001          NA00002          NA00003       
#> [1,] "GT:GQ:DP:HQ" "0|0:48:1:51,51" "1|0:48:8:51,51" "1/1:43:5:.,."
#> [2,] "GT:GQ:DP:HQ" "0|0:49:3:58,50" "0|1:3:5:65,3"   "0/0:41:3"    
#> [3,] "GT:GQ:DP:HQ" "1|2:21:6:23,27" "2|1:2:0:18,2"   "2/2:35:4"    
#> [4,] "GT:GQ:DP:HQ" "0|0:54:7:56,60" "0|0:48:4:51,51" "0/0:61:2"    
#> [5,] "GT:GQ:DP"    "0/1:35:4"       "0/2:17:2"       "1/1:40:3"    
#>      NA00001          NA00002          NA00003       
#> [1,] "0|0:48:1:51,51" "1|0:48:8:51,51" "1/1:43:5:.,."
#> [2,] "0|0:49:3:58,50" "0|1:3:5:65,3"   "0/0:41:3"    
#> [3,] "1|2:21:6:23,27" "2|1:2:0:18,2"   "2/2:35:4"    
#> [4,] "0|0:54:7:56,60" "0|0:48:4:51,51" "0/0:61:2"    
#> [5,] "0/1:35:4"       "0/2:17:2"       "1/1:40:3"

Created on 2019-05-29 by the reprex package (v0.2.1)

However, I would not suggest it unless you're sure you know what you're doing. VCF files contain only variable positions. If you have two VCF files for two individuals the variants are probably different. This means my example is overly simple. Its going to be your responsibility to make sure you have the same variants in the same rows of each data source. So I think the suggested path would be for you to call variants over multiple samples. That way the variant caller has to figure this out and not you.

Good luck! Brian

canholyavkin commented 5 years ago

Dear Brian,

Thank you for the clarification. As far as I understand, vcfR doesn't provide merge the genotypes based on genomic locations. It only binds the rows according to row order, right?

knausb commented 5 years ago

You are correct. With vcfR you can bind columns (or rows). With a bit of programming you should be able to come up with an inventory of all the genomic locations in all of your VCF files. But if that sounds like work then you're probably better off finding a different path. I suggest you go back to your variant calling step because any post-hoc method will only be able to add missing data while going back to the variant calling step you should get information about how the reference allele was called in the samples that were not variable at a position that other samples were variable at.

Good luck! Brian