knausb / vcfR

Tools to work with variant call format files
240 stars 54 forks source link

read.vcfR randomly dropping genotypes #166

Closed akijarl closed 3 years ago

akijarl commented 3 years ago

Hello,

While running a pipeline of replicate simulation data, I noticed that random calculated allele frequencies were showing up as NA in different runs.

I traced this back to select genotype entries getting dropped (showing up as blank entries "" instead of "0|0" or "0|1", etc.) when reading vcf files into R with read.vcfR().

I've confirmed that the original vcf files are intact, and I created a smaller test data set ("test.vcf") and spot-checked which genotypes were missing and it seems to be random:

#First trial
> vcf1 <- read.vcfR("test.vcf")
Scanning file to determine attributes.
...
Processed variant: 9423
All variants processed

> which(vcf1@gt[,-1]=="")
[1] 228810 229751 229780 234489

#Second trial
> vcf1 <- read.vcfR("test.vcf")
Scanning file to determine attributes.
...
Processed variant: 9423
All variants processed

> which(vcf1@gt[,-1]=="")
[1] 226878 228705 230215 232006 232873 232922

#Third trial
> vcf1 <- read.vcfR("test.vcf")
Scanning file to determine attributes.
...
Processed variant: 9423
All variants processed

> which(vcf1@gt[,-1]=="")
[1] 228555 229931 232006 232856

#Fourth trial
> vcf1 <- read.vcfR("test.vcf")
Scanning file to determine attributes.
...
Processed variant: 9423
All variants processed

> which(vcf1@gt[,-1]=="")
[1] 230321 232006

#Fifth trial
> vcf1 <- read.vcfR("test.vcf")
Scanning file to determine attributes.
...
Processed variant: 9423
All variants processed

> which(vcf1@gt[,-1]=="")
[1] 227752 229518 232182 232910 233457

Across the five trials above of just reading in the same file again and again I get five different genotypes showing up blank. Any insights on what might be causing this?

R session info below

> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] vcfR_1.10.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6      lattice_0.20-41   ape_5.3          
 [4] memuse_4.1-0      viridisLite_0.3.0 permute_0.9-5    
 [7] MASS_7.3-51.6     grid_3.6.2        nlme_3.1-147     
[10] magrittr_1.5      vegan_2.5-6       Matrix_1.2-18    
[13] splines_3.6.2     pinfsc50_1.1.0    tools_3.6.2      
[16] parallel_3.6.2    compiler_3.6.2    cluster_2.1.0    
[19] mgcv_1.8-31
akijarl commented 3 years ago

It seems like the a shift in the formatting of the original vcf files (the REF columns of select genotypes were not appropriate characters) was the source of the inconsistent number of blank genotypes. Not sure why the pattern is inconsistent, but it doesn't seem like the original problem lies with read.vcfR()

knausb commented 3 years ago

If the REF column is missing then it sounds like this is not a valid VCF file. Thank you for taking the time the time to resolve and close this issue!