grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
32 stars 13 forks source link

Multiple `ensembl-marts-109` entries in biomaRt's cache break biomaRt:::.listEnsembl() #74

Closed hpages closed 1 year ago

hpages commented 1 year ago

@grimbough Hi Mike,

biomaRt:::.listEnsembl() fails at the moment on Bioconductor builder nebbiolo2, with the following error:

library(biomaRt)
biomaRt:::.listEnsembl()
# Error in if (Sys.time() - as.POSIXct(cache_entry$create_time) < 7) { : 
#   the condition has length > 1

This breaks listEnsembl() and useEnsembl() which both use biomaRt:::.listEnsembl() internally. See for example the current CHECK error for GenomicFeatures on the BioC 3.16 daily builds.

This error seems to be caused by the fact that biomaRt:::.listEnsembl() obtains cache_entry with:

cache_entry <- bfcquery(x = bfc, query = paste0("ensembl-marts-", version_num))

which can sometimes return more than one cache entry, like on nebbiolo2 where it currently returns the 2 following entries:

library(BiocFileCache)
library(biomaRt)

cache <- biomaRt:::.biomartCacheLocation()
bfc <- BiocFileCache::BiocFileCache(cache, ask = FALSE)
version_num <- biomaRt:::.currentEnsemblVersion()["version"]
cache_entry <- bfcquery(x = bfc, query = paste0("ensembl-marts-", version_num))

dim(cache_entry)
# [1]  2 10

as(cache_entry, "DataFrame")
# DataFrame with 2 rows and 10 columns
#           rid             rname         create_time         access_time
#   <character>       <character>         <character>         <character>
# 1     BFC1445 ensembl-marts-109 2023-02-23 01:59:14 2023-02-23 01:59:14
# 2     BFC1446 ensembl-marts-109 2023-02-23 01:59:15 2023-02-23 01:59:15
#                    rpath       rtype                  fpath last_modified_time
#              <character> <character>            <character>          <numeric>
# 1 ~/.cache/biomaRt/311..    relative /tmp/RtmpN9VmMb/file..                 NA
# 2 ~/.cache/biomaRt/320..    relative /tmp/Rtmp7oyu2q/file..                 NA
#          etag   expires
#   <character> <numeric>
# 1          NA        NA
# 2          NA        NA

I don't know if having more than 1 entry for ensembl-marts-109 is a "normal state" for the cache. If not then maybe the issue needs to be reported to the BiocFileCache maintainers (@lshep). But if it is, then it seems that biomaRt:::.listEnsembl() would need to be modified to handle this situation.

@jwokaty @lshep Let's not flush biomaRt's cache on nebbiolo2 (located at ~biocbuild/.cache/biomaRt) until this is sorted out.

Thanks all!

H.

sessionInfo():

R Under development (unstable) (2023-02-14 r83833)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
LAPACK: /home/biocbuild/bbs-3.17-bioc/R/lib/libRlapack.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.55.0

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3         utf8_1.2.3             generics_0.1.3        
 [4] bitops_1.0-7           xml2_1.3.3             RSQLite_2.3.0         
 [7] stringi_1.7.12         hms_1.1.2              digest_0.6.31         
[10] magrittr_2.0.3         fastmap_1.1.1          blob_1.2.3            
[13] progress_1.2.2         AnnotationDbi_1.61.2   GenomeInfoDb_1.35.15  
[16] DBI_1.1.3              httr_1.4.5             purrr_1.0.1           
[19] fansi_1.0.4            XML_3.99-0.13          Biostrings_2.67.0     
[22] cli_3.6.0              rlang_1.0.6            crayon_1.5.2          
[25] dbplyr_2.3.1           XVector_0.39.0         Biobase_2.59.0        
[28] ellipsis_0.3.2         bit64_4.0.5            withr_2.5.0           
[31] cachem_1.0.7           tools_4.3.0            memoise_2.0.1         
[34] dplyr_1.1.0            GenomeInfoDbData_1.2.9 filelock_1.0.2        
[37] BiocGenerics_0.45.0    curl_5.0.0             vctrs_0.5.2           
[40] R6_2.5.1               png_0.1-8              stats4_4.3.0          
[43] lifecycle_1.0.3        BiocFileCache_2.7.2    zlibbioc_1.45.0       
[46] KEGGREST_1.39.0        stringr_1.5.0          S4Vectors_0.37.4      
[49] IRanges_2.33.0         bit_4.0.5              pkgconfig_2.0.3       
[52] pillar_1.8.1           glue_1.6.2             tibble_3.2.0          
[55] tidyselect_1.2.0       compiler_4.3.0         prettyunits_1.1.1     
[58] RCurl_1.98-1.10       
grimbough commented 1 year ago

Hi @hpages thanks for the detailed diagnostics. The mysterious: "Error: processing vignette 'accessing_ensembl.Rmd' failed with diagnostics: object 'ensembl' not found" from the biomaRt build report wasn't much help to me!

Have two entrie for the ensembl-marts-109 is not expected. The code in .listEnsembl() is supposed to identify if the cache exists and either use it if it's "fresh" or remove and reproduce the cached version if it's "stale".

I notice that the two entries were generated only 1 second apart. Is it possible that two separate processes are calling .listEnsembl() with the second happening before the cache entry has been generated by the first? Or maybe the transient Ensembl flakiness has introduced some invalid state?

I'll add some error handling to check for the multiple entry case and simply delete everything, I don't think there's any real drawback to doing that, but it'd be nice to know how we ended up with two entries in the first place.

hpages commented 1 year ago

Is it possible that two separate processes are calling .listEnsembl() with the second happening before the cache entry has been generated by the first?

Of course. During the daily builds, each build machine runs dozens of R CMD build or R CMD check commands in parallel. This could also happen on the user machine, although less likely there. I remember seeing some discussion about BioFileCache and concurrent write access to the cache but I don't know the details. This for example could be relevant: https://github.com/Bioconductor/BiocFileCache/issues/42 What we observe here could just be a manifestation of that.

I'll add some error handling to check for the multiple entry case and simply delete everything

Sounds good, thanks!

grimbough commented 1 year ago

It looks like my patch has at least resolved this specific issue (I'm going to assume that every package on the Linux builder reporting warnings isn't my fault!)

hpages commented 1 year ago

Thanks Mike!

I'm going to assume that every package on the Linux builder reporting warnings isn't my fault!

This was an error, not a warning.