knausb / vcfR

Tools to work with variant call format files
247 stars 55 forks source link

Error in .vcf_stats_gz #111

Closed igordot closed 5 years ago

igordot commented 6 years ago

I've been using vcfR for a while and have had no issues. I upgraded from 1.7.0 to 1.8.0 and now I am running into errors using the same code as before. For example:

> muts_vcfr = read.vcfR(in_vcf, verbose = TRUE)
Scanning file to determine attributes.
Error in .vcf_stats_gz(file, nrows = nrows, skip = skip, verbose = as.integer(verbose)) : 
  std::bad_alloc

I am using non-gzipped VCFs. Installing 1.7.0 solves the problem. Is every VCF expected to be gzipped now?

knausb commented 6 years ago

Hi Igordot, I'm using zlib to handle compressed files. It also deals with uncompressed files. So read.vcfR() should handle both .vcf and .vcf.gz. However, I almost always use compressed files so I might be unaware of an issue using uncompressed files. If you compress it does it read in? If you could provide a minimal reproducible example I could see if I can reproduce it and possibly fix it. I've placed some suggestions for reporting an issue here. Thanks!

igordot commented 6 years ago

Sorry about that. Here is an example:

> data(vcfR_test)
> write.vcf( vcfR_test, file = "vcfR_test.vcf.gz" )
> vcf <- read.vcfR( file = "vcfR_test.vcf.gz", verbose = FALSE )
> R.utils::gunzip("vcfR_test.vcf.gz")
> vcf <- read.vcfR( file = "vcfR_test.vcf", verbose = FALSE )

 *** caught segfault ***
address 0x7611000, cause 'memory not mapped'

Traceback:
 1: .Call(`_vcfR_vcf_stats_gz`, x, nrows, skip, verbose)
 2: .vcf_stats_gz(file, nrows = nrows, skip = skip, verbose = as.integer(verbose))
 3: read.vcfR(file = "vcfR_test.vcf", verbose = FALSE)
knausb commented 6 years ago

Hi Igordot, thanks for the example! I see the following.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.8.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
data(vcfR_test)
write.vcf( vcfR_test, file = "vcfR_test.vcf.gz" )
vcf <- read.vcfR( file = "vcfR_test.vcf.gz", verbose = FALSE )
rm(vcf)
R.utils::gunzip("vcfR_test.vcf.gz")
vcf <- read.vcfR( file = "vcfR_test.vcf", verbose = FALSE )
vcf
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 5 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> *****        *****         *****

Created on 2018-07-16 by the reprex package (v0.2.0).

So I was not successful at reproducing the behaviour you reported. Is it possible that your installation has been corrupted somehow? I recently upgraded to R 3.5.1 and had to install most of my packages again. Perhaps you could post the results of the following command?

sessionInfo()
igordot commented 6 years ago

I am using R 3.3.0. I reinstalled vcfR and Rcpp today when trying to troubleshoot this issue, so those should be functional. Do you know if there are any other packages that could be involved?

igordot commented 6 years ago

I forgot to add sessionInfo(). Here it is:

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] vcfR_1.8.0

loaded via a namespace (and not attached):
 [1] MASS_7.3-50       magrittr_1.5      Matrix_1.2-14     tools_3.3.0      
 [5] parallel_3.3.0    mgcv_1.8-24       Rcpp_0.12.17      nlme_3.1-131.1   
 [9] ape_5.1           grid_3.3.0        permute_0.9-4     vegan_2.5-2      
[13] pinfsc50_1.1.0    viridisLite_0.3.0 cluster_2.0.7-1   lattice_0.20-35  

I tried a different version of R and it works there for some reason. This is the working setup:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /local/apps/R/3.4.1-cairo/lib64/R/lib/libRblas.so
LAPACK: /local/apps/R/3.4.1-cairo/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] vcfR_1.8.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17      lattice_0.20-35   ape_5.0           memuse_4.0-0     
 [5] viridisLite_0.3.0 permute_0.9-4     R.methodsS3_1.7.1 MASS_7.3-47      
 [9] grid_3.4.1        nlme_3.1-131      magrittr_1.5      vegan_2.5-2      
[13] R.oo_1.21.0       R.utils_2.6.0     Matrix_1.2-14     pinfsc50_1.1.0   
[17] tools_3.4.1       parallel_3.4.1    compiler_3.4.1    cluster_2.0.7-1  
[21] mgcv_1.8-17      

For some reason, slightly different dependencies get loaded.

knausb commented 6 years ago

On my Mac I see the following.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.8.0.9000 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
data(vcfR_test)
write.vcf( vcfR_test, file = "vcfR_test.vcf.gz" )
vcf <- read.vcfR( file = "vcfR_test.vcf.gz", verbose = FALSE )
rm(vcf)
R.utils::gunzip("vcfR_test.vcf.gz")
vcf <- read.vcfR( file = "vcfR_test.vcf", verbose = FALSE )
vcf
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 5 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> *****        *****         *****

Created on 2018-07-17 by the reprex package (v0.2.0).

Session info ``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R version 3.5.0 (2018-04-23) #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> tz America/Los_Angeles #> date 2018-07-17 #> Packages ----------------------------------------------------------------- #> package * version date source #> ape 5.1 2018-04-04 CRAN (R 3.5.0) #> backports 1.1.2 2017-12-13 CRAN (R 3.5.0) #> base * 3.5.0 2018-04-24 local #> cluster 2.0.7-1 2018-04-13 CRAN (R 3.5.0) #> compiler 3.5.0 2018-04-24 local #> datasets * 3.5.0 2018-04-24 local #> devtools 1.13.6 2018-06-27 CRAN (R 3.5.0) #> digest 0.6.15 2018-01-28 CRAN (R 3.5.0) #> evaluate 0.10.1 2017-06-24 CRAN (R 3.5.0) #> graphics * 3.5.0 2018-04-24 local #> grDevices * 3.5.0 2018-04-24 local #> grid 3.5.0 2018-04-24 local #> htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0) #> knitr 1.20 2018-02-20 CRAN (R 3.5.0) #> lattice 0.20-35 2017-03-25 CRAN (R 3.5.0) #> magrittr 1.5 2014-11-22 CRAN (R 3.5.0) #> MASS 7.3-50 2018-04-30 CRAN (R 3.5.0) #> Matrix 1.2-14 2018-04-13 CRAN (R 3.5.0) #> memoise 1.1.0 2017-04-21 CRAN (R 3.5.0) #> memuse 4.0-0 2017-11-10 CRAN (R 3.5.0) #> methods * 3.5.0 2018-04-24 local #> mgcv 1.8-24 2018-06-18 CRAN (R 3.5.0) #> nlme 3.1-137 2018-04-07 CRAN (R 3.5.0) #> parallel 3.5.0 2018-04-24 local #> permute 0.9-4 2016-09-09 CRAN (R 3.5.0) #> pinfsc50 1.1.0 2016-12-02 CRAN (R 3.5.0) #> R.methodsS3 1.7.1 2016-02-16 CRAN (R 3.5.0) #> R.oo 1.22.0 2018-04-22 CRAN (R 3.5.0) #> R.utils 2.6.0 2017-11-05 CRAN (R 3.5.0) #> Rcpp 0.12.17 2018-05-18 CRAN (R 3.5.0) #> rmarkdown 1.10 2018-06-11 CRAN (R 3.5.0) #> rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0) #> stats * 3.5.0 2018-04-24 local #> stringi 1.2.3 2018-06-12 CRAN (R 3.5.0) #> stringr 1.3.1 2018-05-10 CRAN (R 3.5.0) #> tools 3.5.0 2018-04-24 local #> utils * 3.5.0 2018-04-24 local #> vcfR * 1.8.0.9000 2018-07-06 local #> vegan 2.5-2 2018-05-17 CRAN (R 3.5.0) #> viridisLite 0.3.0 2018-02-01 CRAN (R 3.5.0) #> withr 2.1.2 2018-03-15 CRAN (R 3.5.0) #> yaml 2.1.19 2018-05-01 CRAN (R 3.5.0) ```

The thread was updated to include the platform the unexpected behavior was observed on: CentOS. I believe our HPC is running CentOS so I'll give it a try there next.

knausb commented 6 years ago

I would not suspect a dependency caused this because it appears to be on the C++ side of things. You could try the following to try to narrow things down.

> library(vcfR)
> data(vcfR_test)
> write.vcf( vcfR_test, file = "vcfR_test.vcf.gz" )
> R.utils::gunzip("vcfR_test.vcf.gz")
> .vcf_stats_gz("vcfR_test.vcf", verbose = FALSE)
       meta header_line    variants     columns   last_line 
         18          19           5          12          12 

The function .vcf_stats_gz() appears to be what's throwing the error. Its a hidden function (not intended for the user to call) that is called by read.vcfR(). I would think this example should still cause the behavior you've reported. Thanks!

igordot commented 6 years ago

Here is the output with 1.7.0:

> .vcf_stats_gz("vcfR_test.vcf", verbose = FALSE)
    meta   header variants  columns 
      18       19        5       12 

And with 1.8.0:

> .vcf_stats_gz("vcfR_test.vcf", verbose = FALSE)
       meta header_line    variants     columns   last_line 
         18          19           5          12       17248 

I also checked further down. I ran .read_meta_gz and it worked. I got the error in .read_body_gz, which I would think should be able to read the same files, but somehow doesn't.

knausb commented 6 years ago

On our HPC I see the following.

> library(vcfR)

   *****       ***   vcfR   ***       *****
   This is vcfR 1.8.0 
     browseVignettes('vcfR') # Documentation
     citation('vcfR') # Citation
   *****       *****      *****       *****

> data(vcfR_test)
> write.vcf( vcfR_test, file = "vcfR_test.vcf.gz" )
> 
> system("gunzip vcfR_test.vcf.gz")
> vcf <- read.vcfR( file = "vcfR_test.vcf", verbose = FALSE )
> vcf
***** Object of Class vcfR *****
3 samples
1 CHROMs
5 variants
Object size: 0 Mb
0 percent missing data
*****        *****         *****
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] vcfR_1.8.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17      lattice_0.20-33   ape_4.1           memuse_4.0-0     
 [5] viridisLite_0.3.0 permute_0.9-4     MASS_7.3-45       grid_3.3.1       
 [9] nlme_3.1-128      magrittr_1.5      vegan_2.5-2       Matrix_1.2-6     
[13] pinfsc50_1.1.0    tools_3.3.1       parallel_3.3.1    cluster_2.0.4    
[17] mgcv_1.8-12       tcltk_3.3.1      

I was going to say this is a different version of CentOS than Igordot has reported, but it is CentOS. But I double checked and you are reporting both 6.8 and 7.0. Are you running this on different computers? In a VM or container? Could that have something to do with this?

Thanks for your help on this @igordot, but unless I can get a reproducible example there isn't much I can do here. Thanks!

igordot commented 6 years ago

I am running this on HPC. We have different nodes with different versions of R. R 3.3.0 is on CentOS 6.8 and R 3.4.1 is on CentOS 7.0.

knausb commented 6 years ago

Hmm, is the issue really the version of vcfR? If you're running it on a different machine there may be other differences as well. Is it possible that you have an older version of lib on one machine? R 3.3.0 is currently old. But vcfR has been around since R 3.0.1, so I don't think that would be an issue. I don't think we've been able to eliminate the possibility that this is something specific to your system. I've now tested your example on three platforms and have failed to reproduce the behavior you've reported. And if I can't reproduce it there's not much I can do to troubleshoot it. I'm also concerned that this may be more than one issue. Your traceback seemed to implicate vcf_stats_gz() but now you think its in read_body_gz().

igordot commented 6 years ago

I originally thought that it wasn't a vcfR issue since it seems more low-level. However, I am having the problem in 1.8.0, but not 1.7.0. Thus, something changed in vcfR.

I agree that it would be very difficult for you to test if you can't reproduce the problem. I tried to look for changes between the two versions, but there is a fair amount of them. Can you think of another simple test I could run?

knausb commented 6 years ago

Hi Igordot, you report here https://github.com/knausb/vcfR/issues/111#issuecomment-405619144 that R 3.3.0 with vcfR 1.8.0 reproduces your error but you also report that R 3.4.1 with vcfR 1.8.0 does not. I've reported a failure to reproduce your error on several platforms including R 3.5.1 (current version) and vcfR 1.8.0 (current version). I think this is why you suspected a dependency? And perhaps its time to revisit that. But it would probably be a C++ library as opposed to an R package. Because I've failed to reproduce your reported behaviour I am unconvinced that this is an issue with vcfR. Perhaps we should address this as an issue that is specific to your system?

I don't think I have enough details to help you. You've stated that you're working on a HPC with at least two computers. Ideally I'd recommend all of your systems use current versions of software. However, more than a few folks have reported issues with upgrading to R 3.5.*. And if you lack admin rights you may have no say in what versions are used. That said, you could install whatever version of R you like in your user directory. When you load the library you could use something like the following.

library("vcfR", lib.loc = "~/R/x86_64-pc-linux-gnu-library/3.5/")

This specifies where the library is located. This should help if you're using a queueing system such as SGE, PBS, or Slurm.

Do those suggestions help?

igordot commented 6 years ago

Thank you for the suggestions. It's definitely a good idea to upgrade, but I have to stick with R 3.3.0 due to a complicated set of reasons.

I agree it sounds like a C++ library issue or some other system setting. However, I am still puzzled by the different behavior in vcfR 1.7.0 and 1.8.0. Wouldn't both versions be relying on the same dependencies?

knausb commented 6 years ago

Yeah, I can see how you'd be puzzled. I'm not saying it makes a lot of sense. Note that you do report using CentOS 6.8 and 7. These difference versions may have different versions of some of the libraries. And its also possible that someone 'customized' something, perhaps for reasons similar to how you're tied to R 3.3.0. I would consider that poor form on a multiuser system. But these things do happen.

igordot commented 6 years ago

Sorry. I should clarify.

I tested vcfR with different versions of R and CentOS to show that vcfR 1.8.0 works for me sometimes (on R 3.4.1 and CentOS 7).

With R 3.3.0 and CentOS 6.8, I used both vcfR 1.7.0 and 1.8.0. I only changed the version of vcfR. All other libraries and system settings stayed the same. I get an error with 1.8.0, but not 1.7.0. I switched between them several times (in case some dependencies were changed) and the behavior does not change.

knausb commented 6 years ago

Thanks for the clarification! That is troubling that 1.8.0 failed on the same machine where 1.7.0 worked. But if we can't reproduce that behavior on a second machine then it indicates that this is specific to that machine. And if I can't recreate the reported behavior on a machine local to me then there's no way for me to troubleshoot it. Can you attain your project goals using 1.7.0 or just use the machine where 1.8.0 works?

igordot commented 6 years ago

I am fine with 1.7.0. I was just hoping to resolve this problem in case it comes up in other contexts.

knausb commented 6 years ago

At the present it does not appear that we've been able to demonstrate that this issue affects any other computer besides the one you're using. If anyone else experiences this behaviour they will be able to find this thread and perhaps help us reproduce it. But for now, I think we should close this issue.

igordot commented 6 years ago

I am okay with closing it. Hopefully someone else is able to replicate it someday. Or at least confirm it.

You can may want to add non-gzipped VCF import to the tests just in case.

shenghuanjie commented 5 years ago

I have the same error using version 1.8.0 on R 3.5.1. The problem only occurs when I read in large files. (>3G)

when I run it using snakemake in my workflow. I got the error:

Scanning file to determine attributes.
Error in .vcf_stats_gz(file, nrows = nrows, skip = skip, verbose = as.integer(verbose)) :

  std::bad_alloc
Calls: read.vcfR -> .vcf_stats_gz
Execution halted

However, when I run it in R, it returns:

Scanning file to determine attributes.

 *** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
 1: .vcf_stats_gz(file, nrows = nrows, skip = skip, verbose = as.integer(verbose))
 2: read.vcfR(file = "/global/home/groups/pl1data/pl1_sudmant/shenghuanjie/temp/phg000520
.v2.GTEx_MidPoint_Imputation.genotype-calls-vcf.c1/GTEx_Analysis_20150112_OMNI_2.5M_5M_45
0Indiv_chr1to22_genot_imput_info04_maf01_HWEp1E6_ConstrVarIDs.vcf.gz")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

Here is my sessionInfo() in R:

R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)

Matrix products: default
BLAS/LAPACK: /global/home/users/shenghuanjie/miniconda3/lib/R/lib/libRblas.so

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] vcfR_1.8.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18      lattice_0.20-35   ape_5.1           viridisLite_0.3.0
 [5] permute_0.9-5     MASS_7.3-50       grid_3.5.1        nlme_3.1-137
 [9] magrittr_1.5      vegan_2.5-4       Matrix_1.2-14     pinfsc50_1.1.0
[13] tools_3.5.1       parallel_3.5.1    compiler_3.5.1    cluster_2.0.7-1
[17] mgcv_1.8-24
knausb commented 5 years ago

Hi Huanjie, can you please provide a minimal reproducible example? I've put some suggestions on how to do this here. Thank you!

shenghuanjie commented 5 years ago

Thank you . Problem solved by devtools::install_github(repo="knausb/vcfR") following this issue. Now I'm getting new errors:

Error in extract.gt(x) : ID column contains non-unique names
Calls: vcfR2genlight -> extract.gt
Execution halted

It's probably something wrong in my file. Apparently, the original error only occurs when I used the entire dataset but not a small sample with the first 1000 lines. Unfortunately, I'm not allowed to share any data because they are sensitive data subjected to Protection Level 1 (PL1). I will try to identify the issue.

knausb commented 5 years ago

Agreed! Please do not share your data if it is sensitive. In section 1.6.1 part 3 of the VCF specification it states that the ID column should contain unique identifiers. From the error message you've reported I'd guess this is your issue. You can subset your vcfR object using the square brackets.

data("vcfR_test")
vcfR_test[1:2,]

This should help you identify the offending line(s). Good luck!

ruqianl commented 5 years ago

I'm having the same error. The strange thing is that all my vcf files were produced using the same command. There is only one file that triggers the error. Still quite confused.

Any suggestions?

RL

knausb commented 5 years ago

Hi @Rachael-rq ,

This issue does not appear to have identified an actionable item. So adding to it does not appear as if it will help you. If you'd like help I suggest you open a new issue and include a minimal reproducible example so that there may be something for me to address. I've put some suggestions on how to do this here. Thank you!

argymeg commented 5 years ago

Hi,

Thought I should leave here for the record that I was also having this problem with just one of ~200 files, still can't figure out what made it different from every other one, but in any case everything works fine after installing the development version.