lawremi / rtracklayer

R interface to genome annotation files and the UCSC genome browser
Other
29 stars 17 forks source link

import() leaves open file connections #14

Closed LTLA closed 5 years ago

LTLA commented 5 years ago

A simple example with the latest BioC-devel version:

library(rtracklayer)
track <- import(system.file("tests", "v1.gff", package = "rtracklayer")) # from ?import
showConnections()
##   description                                                                              
## 3 "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/rtracklayer/tests/v1.gff"
## 4 "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/rtracklayer/tests/v1.gff"
##   class  mode text   isopen   can read can write
## 3 "file" "r"  "text" "opened" "yes"    "no"     
## 4 "file" "r"  "text" "opened" "yes"    "no"     

This is causing local CHECK failures for diffHic, as R complains about open file connections.

Session information ``` R Under development (unstable) (2018-12-07 r75787) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: OS X El Capitan 10.11.6 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] rtracklayer_1.43.1 GenomicRanges_1.35.1 GenomeInfoDb_1.19.1 [4] IRanges_2.17.4 S4Vectors_0.21.9 BiocGenerics_0.29.1 loaded via a namespace (and not attached): [1] matrixStats_0.54.0 lattice_0.20-38 [3] XML_3.98-1.16 Rsamtools_1.35.0 [5] Biostrings_2.51.2 GenomicAlignments_1.19.1 [7] bitops_1.0-6 grid_3.6.0 [9] zlibbioc_1.29.0 XVector_0.23.0 [11] Matrix_1.2-15 BiocParallel_1.17.6 [13] tools_3.6.0 Biobase_2.43.0 [15] RCurl_1.95-4.11 DelayedArray_0.9.5 [17] compiler_3.6.0 SummarizedExperiment_1.13.0 [19] GenomeInfoDbData_1.2.0 ```
lawremi commented 5 years ago

In order to keep the internal code simple, the connections are managed via finalizers, so when the garbage collector runs, the connections are cleaned up. This is pretty baked into the design of the package now, so it would be tough to change. I guess this is new R CMD check behavior, and it is going to cause many packages to fail. Every import method could just invoke the GC on exit, but I wish there was another way.

LTLA commented 5 years ago

Hm. Sounds problematic. If it helps, this is what I see from R CMD checking diffHic:

* checking examples ... ERROR
Running examples in ‘diffHic-Ex.R’ failed
The error most likely occurred in:

> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: readMTX2IntSet
> ### Title: Create an InteractionSet from a BED file and Matrix Market files
> ### Aliases: readMTX2IntSet
> ### Keywords: read
> 
> ### ** Examples
> 
> library(Matrix)

Attaching package: ‘Matrix’

The following object is masked from ‘package:S4Vectors’:

    expand

## And so on...

> base::assign(".dptime", (proc.time() - get(".ptime", pos = "CheckExEnv")), pos = "CheckExEnv")
> base::cat("readMTX2IntSet", base::get(".format_ptime", pos = 'CheckExEnv')(get(".dptime", pos = "CheckExEnv")), "\n", file=base::get(".ExTimings", pos = 'CheckExEnv'), append=TRUE, sep="\t")
> cleanEx()

detaching ‘package:Matrix’

Error: connections left open:
        /var/folders/md/mvr0n48x51z1gk2qmthrb76sm8p11p/T//Rtmp17mhWi/file11756f5bc7a/regions.bed (file)
Execution halted
Shians commented 5 years ago

I occasionally get bitten by this, and from my understanding it's from a new check discussed here and added here and here. Doesn't seem like they want to change the check behaviour, which is a bit unfortunate, since I think the rscala makes a good argument for his user interface.

It just seems bizarre to expect people to stick cleanup operations in their example to satisfy the checking, when it does not represent any example of recommended usage. 🤷

LTLA commented 5 years ago

So it looks like this change to R-devel is intentional. Great, just great.

As it currently stands, all packages downstream of rtracklayer would have to \dontshow{gc()} at the end of their examples. And all their downstream packages would have to do the same, and so on...

That sounds horrible.

lawremi commented 5 years ago

Agreed. The rtracklayer approach was a hack, and it's been caught. Connections are a finite resource, so they should not be left to asynchronous garbage collection driven by memory constraints. I'm not looking forward to changing rtracklayer, but I'm not sure there is another way.

lawremi commented 5 years ago

Ok, rtracklayer 1.43.2 should avoid leaving connections open. This took some pretty major surgery, so I'm sure there will be issues.