Closed igordot closed 5 years ago
Hi Igordot, I'm using zlib to handle compressed files. It also deals with uncompressed files. So read.vcfR()
should handle both .vcf and .vcf.gz. However, I almost always use compressed files so I might be unaware of an issue using uncompressed files. If you compress it does it read in? If you could provide a minimal reproducible example I could see if I can reproduce it and possibly fix it. I've placed some suggestions for reporting an issue here. Thanks!
Sorry about that. Here is an example:
> data(vcfR_test)
> write.vcf( vcfR_test, file = "vcfR_test.vcf.gz" )
> vcf <- read.vcfR( file = "vcfR_test.vcf.gz", verbose = FALSE )
> R.utils::gunzip("vcfR_test.vcf.gz")
> vcf <- read.vcfR( file = "vcfR_test.vcf", verbose = FALSE )
*** caught segfault ***
address 0x7611000, cause 'memory not mapped'
Traceback:
1: .Call(`_vcfR_vcf_stats_gz`, x, nrows, skip, verbose)
2: .vcf_stats_gz(file, nrows = nrows, skip = skip, verbose = as.integer(verbose))
3: read.vcfR(file = "vcfR_test.vcf", verbose = FALSE)
Hi Igordot, thanks for the example! I see the following.
library(vcfR)
#>
#> ***** *** vcfR *** *****
#> This is vcfR 1.8.0
#> browseVignettes('vcfR') # Documentation
#> citation('vcfR') # Citation
#> ***** ***** ***** *****
data(vcfR_test)
write.vcf( vcfR_test, file = "vcfR_test.vcf.gz" )
vcf <- read.vcfR( file = "vcfR_test.vcf.gz", verbose = FALSE )
rm(vcf)
R.utils::gunzip("vcfR_test.vcf.gz")
vcf <- read.vcfR( file = "vcfR_test.vcf", verbose = FALSE )
vcf
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 5 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> ***** ***** *****
Created on 2018-07-16 by the reprex package (v0.2.0).
So I was not successful at reproducing the behaviour you reported. Is it possible that your installation has been corrupted somehow? I recently upgraded to R 3.5.1 and had to install most of my packages again. Perhaps you could post the results of the following command?
sessionInfo()
I am using R 3.3.0. I reinstalled vcfR
and Rcpp
today when trying to troubleshoot this issue, so those should be functional. Do you know if there are any other packages that could be involved?
I forgot to add sessionInfo()
. Here it is:
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] vcfR_1.8.0
loaded via a namespace (and not attached):
[1] MASS_7.3-50 magrittr_1.5 Matrix_1.2-14 tools_3.3.0
[5] parallel_3.3.0 mgcv_1.8-24 Rcpp_0.12.17 nlme_3.1-131.1
[9] ape_5.1 grid_3.3.0 permute_0.9-4 vegan_2.5-2
[13] pinfsc50_1.1.0 viridisLite_0.3.0 cluster_2.0.7-1 lattice_0.20-35
I tried a different version of R and it works there for some reason. This is the working setup:
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /local/apps/R/3.4.1-cairo/lib64/R/lib/libRblas.so
LAPACK: /local/apps/R/3.4.1-cairo/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] vcfR_1.8.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 lattice_0.20-35 ape_5.0 memuse_4.0-0
[5] viridisLite_0.3.0 permute_0.9-4 R.methodsS3_1.7.1 MASS_7.3-47
[9] grid_3.4.1 nlme_3.1-131 magrittr_1.5 vegan_2.5-2
[13] R.oo_1.21.0 R.utils_2.6.0 Matrix_1.2-14 pinfsc50_1.1.0
[17] tools_3.4.1 parallel_3.4.1 compiler_3.4.1 cluster_2.0.7-1
[21] mgcv_1.8-17
For some reason, slightly different dependencies get loaded.
On my Mac I see the following.
library(vcfR)
#>
#> ***** *** vcfR *** *****
#> This is vcfR 1.8.0.9000
#> browseVignettes('vcfR') # Documentation
#> citation('vcfR') # Citation
#> ***** ***** ***** *****
data(vcfR_test)
write.vcf( vcfR_test, file = "vcfR_test.vcf.gz" )
vcf <- read.vcfR( file = "vcfR_test.vcf.gz", verbose = FALSE )
rm(vcf)
R.utils::gunzip("vcfR_test.vcf.gz")
vcf <- read.vcfR( file = "vcfR_test.vcf", verbose = FALSE )
vcf
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 5 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> ***** ***** *****
Created on 2018-07-17 by the reprex package (v0.2.0).
The thread was updated to include the platform the unexpected behavior was observed on: CentOS. I believe our HPC is running CentOS so I'll give it a try there next.
I would not suspect a dependency caused this because it appears to be on the C++ side of things. You could try the following to try to narrow things down.
> library(vcfR)
> data(vcfR_test)
> write.vcf( vcfR_test, file = "vcfR_test.vcf.gz" )
> R.utils::gunzip("vcfR_test.vcf.gz")
> .vcf_stats_gz("vcfR_test.vcf", verbose = FALSE)
meta header_line variants columns last_line
18 19 5 12 12
The function .vcf_stats_gz()
appears to be what's throwing the error. Its a hidden function (not intended for the user to call) that is called by read.vcfR()
. I would think this example should still cause the behavior you've reported. Thanks!
Here is the output with 1.7.0:
> .vcf_stats_gz("vcfR_test.vcf", verbose = FALSE)
meta header variants columns
18 19 5 12
And with 1.8.0:
> .vcf_stats_gz("vcfR_test.vcf", verbose = FALSE)
meta header_line variants columns last_line
18 19 5 12 17248
I also checked further down. I ran .read_meta_gz
and it worked. I got the error in .read_body_gz
, which I would think should be able to read the same files, but somehow doesn't.
On our HPC I see the following.
> library(vcfR)
***** *** vcfR *** *****
This is vcfR 1.8.0
browseVignettes('vcfR') # Documentation
citation('vcfR') # Citation
***** ***** ***** *****
> data(vcfR_test)
> write.vcf( vcfR_test, file = "vcfR_test.vcf.gz" )
>
> system("gunzip vcfR_test.vcf.gz")
> vcf <- read.vcfR( file = "vcfR_test.vcf", verbose = FALSE )
> vcf
***** Object of Class vcfR *****
3 samples
1 CHROMs
5 variants
Object size: 0 Mb
0 percent missing data
***** ***** *****
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] vcfR_1.8.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 lattice_0.20-33 ape_4.1 memuse_4.0-0
[5] viridisLite_0.3.0 permute_0.9-4 MASS_7.3-45 grid_3.3.1
[9] nlme_3.1-128 magrittr_1.5 vegan_2.5-2 Matrix_1.2-6
[13] pinfsc50_1.1.0 tools_3.3.1 parallel_3.3.1 cluster_2.0.4
[17] mgcv_1.8-12 tcltk_3.3.1
I was going to say this is a different version of CentOS than Igordot has reported, but it is CentOS. But I double checked and you are reporting both 6.8 and 7.0. Are you running this on different computers? In a VM or container? Could that have something to do with this?
Thanks for your help on this @igordot, but unless I can get a reproducible example there isn't much I can do here. Thanks!
I am running this on HPC. We have different nodes with different versions of R. R 3.3.0 is on CentOS 6.8 and R 3.4.1 is on CentOS 7.0.
Hmm, is the issue really the version of vcfR? If you're running it on a different machine there may be other differences as well. Is it possible that you have an older version of lib on one machine? R 3.3.0 is currently old. But vcfR has been around since R 3.0.1, so I don't think that would be an issue. I don't think we've been able to eliminate the possibility that this is something specific to your system. I've now tested your example on three platforms and have failed to reproduce the behavior you've reported. And if I can't reproduce it there's not much I can do to troubleshoot it. I'm also concerned that this may be more than one issue. Your traceback seemed to implicate vcf_stats_gz()
but now you think its in read_body_gz()
.
I originally thought that it wasn't a vcfR issue since it seems more low-level. However, I am having the problem in 1.8.0, but not 1.7.0. Thus, something changed in vcfR.
I agree that it would be very difficult for you to test if you can't reproduce the problem. I tried to look for changes between the two versions, but there is a fair amount of them. Can you think of another simple test I could run?
Hi Igordot, you report here https://github.com/knausb/vcfR/issues/111#issuecomment-405619144 that R 3.3.0 with vcfR 1.8.0 reproduces your error but you also report that R 3.4.1 with vcfR 1.8.0 does not. I've reported a failure to reproduce your error on several platforms including R 3.5.1 (current version) and vcfR 1.8.0 (current version). I think this is why you suspected a dependency? And perhaps its time to revisit that. But it would probably be a C++ library as opposed to an R package. Because I've failed to reproduce your reported behaviour I am unconvinced that this is an issue with vcfR. Perhaps we should address this as an issue that is specific to your system?
I don't think I have enough details to help you. You've stated that you're working on a HPC with at least two computers. Ideally I'd recommend all of your systems use current versions of software. However, more than a few folks have reported issues with upgrading to R 3.5.*. And if you lack admin rights you may have no say in what versions are used. That said, you could install whatever version of R you like in your user directory. When you load the library you could use something like the following.
library("vcfR", lib.loc = "~/R/x86_64-pc-linux-gnu-library/3.5/")
This specifies where the library is located. This should help if you're using a queueing system such as SGE, PBS, or Slurm.
Do those suggestions help?
Thank you for the suggestions. It's definitely a good idea to upgrade, but I have to stick with R 3.3.0 due to a complicated set of reasons.
I agree it sounds like a C++ library issue or some other system setting. However, I am still puzzled by the different behavior in vcfR 1.7.0 and 1.8.0. Wouldn't both versions be relying on the same dependencies?
Yeah, I can see how you'd be puzzled. I'm not saying it makes a lot of sense. Note that you do report using CentOS 6.8 and 7. These difference versions may have different versions of some of the libraries. And its also possible that someone 'customized' something, perhaps for reasons similar to how you're tied to R 3.3.0. I would consider that poor form on a multiuser system. But these things do happen.
Sorry. I should clarify.
I tested vcfR with different versions of R and CentOS to show that vcfR 1.8.0 works for me sometimes (on R 3.4.1 and CentOS 7).
With R 3.3.0 and CentOS 6.8, I used both vcfR 1.7.0 and 1.8.0. I only changed the version of vcfR. All other libraries and system settings stayed the same. I get an error with 1.8.0, but not 1.7.0. I switched between them several times (in case some dependencies were changed) and the behavior does not change.
Thanks for the clarification! That is troubling that 1.8.0 failed on the same machine where 1.7.0 worked. But if we can't reproduce that behavior on a second machine then it indicates that this is specific to that machine. And if I can't recreate the reported behavior on a machine local to me then there's no way for me to troubleshoot it. Can you attain your project goals using 1.7.0 or just use the machine where 1.8.0 works?
I am fine with 1.7.0. I was just hoping to resolve this problem in case it comes up in other contexts.
At the present it does not appear that we've been able to demonstrate that this issue affects any other computer besides the one you're using. If anyone else experiences this behaviour they will be able to find this thread and perhaps help us reproduce it. But for now, I think we should close this issue.
I am okay with closing it. Hopefully someone else is able to replicate it someday. Or at least confirm it.
You can may want to add non-gzipped VCF import to the tests just in case.
I have the same error using version 1.8.0 on R 3.5.1. The problem only occurs when I read in large files. (>3G)
when I run it using snakemake in my workflow. I got the error:
Scanning file to determine attributes.
Error in .vcf_stats_gz(file, nrows = nrows, skip = skip, verbose = as.integer(verbose)) :
std::bad_alloc
Calls: read.vcfR -> .vcf_stats_gz
Execution halted
However, when I run it in R, it returns:
Scanning file to determine attributes.
*** caught segfault ***
address (nil), cause 'memory not mapped'
Traceback:
1: .vcf_stats_gz(file, nrows = nrows, skip = skip, verbose = as.integer(verbose))
2: read.vcfR(file = "/global/home/groups/pl1data/pl1_sudmant/shenghuanjie/temp/phg000520
.v2.GTEx_MidPoint_Imputation.genotype-calls-vcf.c1/GTEx_Analysis_20150112_OMNI_2.5M_5M_45
0Indiv_chr1to22_genot_imput_info04_maf01_HWEp1E6_ConstrVarIDs.vcf.gz")
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Here is my sessionInfo() in R:
R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)
Matrix products: default
BLAS/LAPACK: /global/home/users/shenghuanjie/miniconda3/lib/R/lib/libRblas.so
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] vcfR_1.8.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.18 lattice_0.20-35 ape_5.1 viridisLite_0.3.0
[5] permute_0.9-5 MASS_7.3-50 grid_3.5.1 nlme_3.1-137
[9] magrittr_1.5 vegan_2.5-4 Matrix_1.2-14 pinfsc50_1.1.0
[13] tools_3.5.1 parallel_3.5.1 compiler_3.5.1 cluster_2.0.7-1
[17] mgcv_1.8-24
Hi Huanjie, can you please provide a minimal reproducible example? I've put some suggestions on how to do this here. Thank you!
Thank you .
Problem solved by devtools::install_github(repo="knausb/vcfR")
following this issue.
Now I'm getting new errors:
Error in extract.gt(x) : ID column contains non-unique names
Calls: vcfR2genlight -> extract.gt
Execution halted
It's probably something wrong in my file. Apparently, the original error only occurs when I used the entire dataset but not a small sample with the first 1000 lines. Unfortunately, I'm not allowed to share any data because they are sensitive data subjected to Protection Level 1 (PL1). I will try to identify the issue.
Agreed! Please do not share your data if it is sensitive. In section 1.6.1 part 3 of the VCF specification it states that the ID column should contain unique identifiers. From the error message you've reported I'd guess this is your issue. You can subset your vcfR object using the square brackets.
data("vcfR_test")
vcfR_test[1:2,]
This should help you identify the offending line(s). Good luck!
I'm having the same error. The strange thing is that all my vcf files were produced using the same command. There is only one file that triggers the error. Still quite confused.
Any suggestions?
RL
Hi @Rachael-rq ,
This issue does not appear to have identified an actionable item. So adding to it does not appear as if it will help you. If you'd like help I suggest you open a new issue and include a minimal reproducible example so that there may be something for me to address. I've put some suggestions on how to do this here. Thank you!
Hi,
Thought I should leave here for the record that I was also having this problem with just one of ~200 files, still can't figure out what made it different from every other one, but in any case everything works fine after installing the development version.
I've been using vcfR for a while and have had no issues. I upgraded from 1.7.0 to 1.8.0 and now I am running into errors using the same code as before. For example:
I am using non-gzipped VCFs. Installing 1.7.0 solves the problem. Is every VCF expected to be gzipped now?