knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

Can't use a logical vector to subset a vcf when using write.vcf() #134

Closed jsstanley closed 5 years ago

jsstanley commented 5 years ago

Problems with subsetting using mask argument in write.vcf()

When using the write.vcf() function with a 'mask' argument to save a filtered version of a VCF file, I am getting the following error message:

Warning message:
In if (mask == FALSE) { :
  the condition has length > 1 and only the first element will be used

I am trying to create a filtered VCF file with only variants that have PASS in the 'FILTER' column of the 'fix' part of the VCF file. The documentation for the write.vcf() function states I can use the 'mask' argument with a "logical vector indicating rows to use". Unfortunately, it seems that the function is only expecting me to provide a single Boolean value. I suspect this issue may result from the use of if (mask == TRUE) rather than ifelse (mask == TRUE), although I may be mistaken.

I could use the tidy format to filter out rows using vcfR2tidy() but I need a VCF file as an output for downstream analyses and unfortunately I can't locate an equivalent function to convert a tidy-style object back to a VCF object.

Here's a minimal reproducible example to illustrate the issue:

Load the vcfR package

if (require("vcfR") == FALSE) {
  install.packages("vcfR")
  library("vcfR")
} else {
  library("vcfR")
}

Create example data

vcfTest <- vcf
# subset the example vcf with random sample cols I came up with
vcfTest@gt <- vcfTest@gt[, c(1, 3:8)]
# create a vector with the length of the col 'FILTER'
replaceLength <- length(vcfTest@fix[, "FILTER"])
# create a vector with the variables 'PASS' and 'FAIL'.
randomVals <- c("PASS", "FAIL")
# replace all the values in the FILTER col randomly with either 'PASS' or 'FAIL'
vcfTest@fix[, "FILTER"] <- sample(x = randomVals, size = replaceLength, replace = TRUE)
# write the new sample vcf file to the current working directory
write.vcf(vcfTest, file = "test.vcf")

Filter the VCF file

filtered <- read.vcfR("test.vcf")
PASSfilter <- filtered@fix[, "FILTER"] %in% "PASS"
write.vcf(filtered, mask = PASSfilter, file = "filtered.vcf")

Error message that is returned in the console following execution of above code

Warning messages:
1: In if (mask == FALSE) { :
  the condition has length > 1 and only the first element will be used
2: In if (mask == TRUE) { :
  the condition has length > 1 and only the first element will be used

sessionInfo() output

R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.0.1   purrr_0.3.2     readr_1.3.1     tidyr_0.8.3    
 [7] tibble_2.1.1    ggplot2_3.1.1   tidyverse_1.2.1 vcfR_1.8.0     

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5  memuse_4.0-0      splines_3.6.0     haven_2.1.0       lattice_0.20-38  
 [6] colorspace_1.4-1  generics_0.0.2    viridisLite_0.3.0 yaml_2.2.0        mgcv_1.8-28      
[11] rlang_0.3.4       pillar_1.3.1      glue_1.3.1        withr_2.1.2       modelr_0.1.4     
[16] readxl_1.3.1      plyr_1.8.4        munsell_0.5.0     gtable_0.3.0      cellranger_1.1.0 
[21] rvest_0.3.3       permute_0.9-5     parallel_3.6.0    broom_0.5.2       Rcpp_1.0.1       
[26] pinfsc50_1.1.0    backports_1.1.4   scales_1.0.0      vegan_2.5-4       jsonlite_1.6     
[31] hms_0.4.2         stringi_1.4.3     grid_3.6.0        cli_1.1.0         tools_3.6.0      
[36] magrittr_1.5      lazyeval_0.2.2    cluster_2.0.8     crayon_1.3.4      ape_5.3          
[41] pkgconfig_2.0.2   MASS_7.3-51.4     Matrix_1.2-17     xml2_1.2.0        lubridate_1.7.4  
[46] assertthat_0.2.1  httr_1.4.0        rstudioapi_0.10   R6_2.4.0          nlme_3.1-139     
[51] compiler_3.6.0   
knausb commented 5 years ago

Hi Jay,

Thanks for taking the time to create an example! In order to make it a reproducible example we need a data set we can share. And I think I can simplify things a bit.

library("vcfR")
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.8.0.9000 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
# Load example data.
data(vcfR_test)
vcfR_test
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 5 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> *****        *****         *****

# Convert the FILTER column to PASS/FAIL.
vcfR_test@fix[2:3, "FILTER"] <- "FAIL"
# Validate.
getFILTER(vcfR_test)
#> [1] "PASS" "FAIL" "FAIL" "PASS" "PASS"

PASSfilter <- vcfR_test@fix[, "FILTER"] %in% "PASS"
write.vcf(vcfR_test, mask = PASSfilter, file = "filtered.vcf.gz")
#> Warning in if (mask == FALSE) {: the condition has length > 1 and only the
#> first element will be used
#> Warning in if (mask == TRUE) {: the condition has length > 1 and only the
#> first element will be used

# A fix
vcfR_test <- vcfR_test[PASSfilter, ]
vcfR_test
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 3 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> *****        *****         *****
write.vcf(vcfR_test, file = "filtered.vcf.gz")

Created on 2019-05-14 by the reprex package (v0.2.1)

Our documentation does say a mask can be a "logical vector" but in the code we test if it is set to TRUE or FALSE as if it is a vector of length one. I can see now that this may be confusing.

Use of a mask sounded like a good idea to me early on in the development of this project. In practice, I never use it. And I suspect if others had used it this would have been reported by now. My vote is to deprecate the argument "mask" so we can get rid of it. Any other opinions out there?

jsstanley commented 5 years ago

Hi Brian,

Thanks for responding so quickly! The alternative code you provided seems to work well for me.

I don't have much experience working with the package but if the 'mask' argument is redundant and there are better ways of filtering, then I think it's probably worth deprecating.

Thanks again for your help!