knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

read.vcfR aborts RStudio/catches segfault in R #137

Closed SMartinOSU closed 5 years ago

SMartinOSU commented 5 years ago

Hello, I'm trying to use vcfR to read in a large number of vcf files generated by SLiM 3 as part of an ABC model. However, I'm randomly getting some .vcf files that crash vcfR, and when I look at them I don't see any reason for these files to be causing a problem. It's also a pretty rare occurrence (~4-5 times in 50,000 files). I tracked down one of the problematic vcf files, and I was able to reproduce it crashing in RStudio with: test<-vcfR:read.vcfR("./slim_full_30054.vcf")

This is the start of the problem vcf, sessionInfo for the windows crash, and I attached the whole vcf file as a text file:

##fileformat=VCFv4.2
##fileDate=20190529
##source=SLiM
##INFO=<ID=MID,Number=.,Type=Integer,Description="Mutation ID in SLiM">
##INFO=<ID=S,Number=.,Type=Float,Description="Selection Coefficient">
##INFO=<ID=DOM,Number=.,Type=Float,Description="Dominance">
##INFO=<ID=PO,Number=.,Type=Integer,Description="Population of Origin">
##INFO=<ID=GO,Number=.,Type=Integer,Description="Generation of Origin">
##INFO=<ID=MT,Number=.,Type=Integer,Description="Mutation Type">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele Count">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  i0  i1  i2  i3  i4  i5 

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] raster_2.8-19  sp_1.3-1       vcfR_1.8.0     adegenet_2.1.1 ade4_1.7-13   

loaded via a namespace (and not attached):
 [1] gtools_3.8.1      tidyselect_0.2.5  reshape2_1.4.3    purrr_0.3.2       sf_0.7-3          splines_3.5.2     lattice_0.20-38   expm_0.999-3     
 [9] colorspace_1.4-1  viridisLite_0.3.0 htmltools_0.3.6   yaml_2.2.0        mgcv_1.8-26       rlang_0.3.4       e1071_1.7-0.1     later_0.8.0      
[17] pillar_1.3.1      glue_1.3.1        DBI_1.0.0         plyr_1.8.4        stringr_1.4.0     munsell_0.5.0     gtable_0.3.0      codetools_0.2-15 
[25] coda_0.19-2       permute_0.9-4     httpuv_1.5.1      parallel_3.5.2    class_7.3-14      spdep_1.0-2       pinfsc50_1.1.0    Rcpp_1.0.1       
[33] xtable_1.8-3      promises_1.0.1    scales_1.0.0      classInt_0.3-1    gdata_2.18.0      vegan_2.5-4       mime_0.6          deldir_0.1-16    
[41] ggplot2_3.1.1     digest_0.6.18     stringi_1.4.3     gmodels_2.18.1    dplyr_0.8.0.1     shiny_1.3.0       grid_3.5.2        tools_3.5.2      
[49] LearnBayes_2.15.1 magrittr_1.5      lazyeval_0.2.2    tibble_2.1.1      cluster_2.0.7-1   crayon_1.3.4      ape_5.2           seqinr_3.4-5     
[57] pkgconfig_2.0.2   MASS_7.3-51.1     Matrix_1.2-15     spData_0.3.0      assertthat_0.2.1  rstudioapi_0.9.0  R6_2.4.0          boot_1.3-20      
[65] units_0.6-2       igraph_1.2.4      nlme_3.1-137      compiler_3.5.2

slim_full_30054.txt I'm not sure if this is an issue with vcfR or with how SLiM writes out its files, and would greatly appreciate any insight. Thank you for your time, and vcfR has been a really useful package for a lot of my work! -Scott

knausb commented 5 years ago

Hi @SMartinOSU,

Thank you for the detailed report! I'm interpreting this as a "rare" issue as opposed to a "random" issue. If it were random you might error on a specific file but when you go back to that file you actually can open it. With a rare issue you might error while opening one of your 50,000 files but when you trace down that file you consistently and reproducibly get your error.

Here's what I see on my machine.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.8.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
# I'm using reprex::reprex() here.
# It works from a tmp directory so I have to tell it where the file is.
test <- read.vcfR("~/Desktop/sandbox/SMartinOSU/slim_full_30054.vcf")
#> Scanning file to determine attributes.
#> Error in read.vcfR("~/Desktop/sandbox/SMartinOSU/slim_full_30054.vcf"): Your file appears to have 89 header elements and 9724 columns in the body.
#>  This should never happen!
# test
sessionInfo()
#> R version 3.5.2 (2018-12-20)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.5 LTS
#> 
#> Matrix products: default
#> BLAS: /usr/lib/libblas/libblas.so.3.6.0
#> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] vcfR_1.8.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.1        knitr_1.21        cluster_2.0.7-1  
#>  [4] magrittr_1.5      MASS_7.3-51.1     viridisLite_0.3.0
#>  [7] ape_5.2           lattice_0.20-38   stringr_1.4.0    
#> [10] highr_0.7         tools_3.5.2       parallel_3.5.2   
#> [13] grid_3.5.2        nlme_3.1-137      mgcv_1.8-26      
#> [16] xfun_0.4          vegan_2.5-3       htmltools_0.3.6  
#> [19] yaml_2.2.0        digest_0.6.18     permute_0.9-4    
#> [22] Matrix_1.2-15     evaluate_0.12     rmarkdown_1.11   
#> [25] pinfsc50_1.1.0    stringi_1.4.3     compiler_3.5.2

Created on 2019-05-31 by the reprex package (v0.2.1)

This is not a segfault as reported. But it is an error. Is this different from what @SMartinOSU sees on their machine?

knausb commented 5 years ago

I've rerun the example with a different version of vcfR and it appears the problem has been solved.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.8.0.9000 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
# I'm using reprex::reprex() here.
# It works from a tmp directory so I have to tell it where the file is.
test <- read.vcfR("~/Desktop/sandbox/SMartinOSU/slim_full_30054.vcf")
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 12
#>   header_line: 13
#>   variant count: 618
#>   column count: 89
#> 
Meta line 12 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 618
#>   Character matrix gt cols: 89
#>   skip: 0
#>   nrows: 618
#>   row_num: 0
#> 
Processed variant: 618
#> All variants processed
test
#> ***** Object of Class vcfR *****
#> 80 samples
#> 1 CHROMs
#> 618 variants
#> Object size: 0.5 Mb
#> 0 percent missing data
#> *****        *****         *****
sessionInfo()
#> R version 3.5.2 (2018-12-20)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.5 LTS
#> 
#> Matrix products: default
#> BLAS: /usr/lib/libblas/libblas.so.3.6.0
#> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] vcfR_1.8.0.9000
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.1        knitr_1.21        cluster_2.0.7-1  
#>  [4] magrittr_1.5      MASS_7.3-51.1     viridisLite_0.3.0
#>  [7] ape_5.2           lattice_0.20-38   stringr_1.4.0    
#> [10] highr_0.7         tools_3.5.2       parallel_3.5.2   
#> [13] grid_3.5.2        nlme_3.1-137      mgcv_1.8-26      
#> [16] xfun_0.4          vegan_2.5-3       htmltools_0.3.6  
#> [19] yaml_2.2.0        digest_0.6.18     permute_0.9-4    
#> [22] Matrix_1.2-15     evaluate_0.12     rmarkdown_1.11   
#> [25] pinfsc50_1.1.0    stringi_1.4.3     compiler_3.5.2   
#> [28] memuse_4.0-0

Created on 2019-05-31 by the reprex package (v0.2.1)

vcfR_1.8.0 is the current version on CRAN and what is reported as used by the post. vcfR_1.8.0.9000 is the version on GitHub and is more current. Could you try to install this version and see if it resolves your issue? You can install the GitHub version as follows (your system will need a compiler).

devtools::install_github(repo="knausb/vcfR")

Thanks! Brian

SMartinOSU commented 5 years ago

I installed version 1.8.0.9 from github, and that seems to have solved the problem. I queued up my bash scripts to run through all the other vcf files, and should know by morning if they all were read successfully. Once I confirm those files go through, I'll plan on closing this issue. Thanks for the quick reply and help,

-Scott

Edit: Looks like everything is working great now, thanks for all the help!