miraisolutions / xlconnect

XLConnect: Excel Connector for R
129 stars 32 forks source link

Feature: Memoize to sidecar cache file #228

Open chrysn opened 2 days ago

chrysn commented 2 days ago

What's the problem this feature will solve?

XLConnect is slow and memory hungry in some scenarios; that seems to be a known and nontrivial concern.

When the Excel files that are loaded change infrequently compared to R code changes, and data is only read and not written to Excel files, a viable solution might be the creation of sidecar memoization files.

Describe the solution you'd like

I suggest that the load function gets a new boolean parameter "memoize" (probably off-by-default). If that parameter is set, XLConnect comes up with a file name (maybe "oldfilename.xlsx.cache"). If that cache file does not exist, or is older than the file being loaded, regular XLConnect loading is performed, and the resulting data frame saveRDS()'d in the cache file, along with the list of any arguments that influence loading outcome. Otherwise, the cache file is loaded. The loading step contains a check for whether the arguments stored there match the ones in the function; if not, it falls back to regular loading and cache recreation.

Alternative Solutions

There are some variations on how the file name is best created, and what information is stored in the cache file:

A viable alternative is creating such a memoization function ad hoc or in some personal library; however, that adds hard-to-explain code to scripts primarily created by users who are more familiar with domain concepts than programming concepts.

Additional context

I'm unsure whether XLConnect supports loading data from Excel files that is spread out over multiple files (say, cross-file references or anything like that). If so, the cache file may need to keep a list of files that were loaded during the initial creation run, and check all their dates before using the cached file.

spoltier commented 10 hours ago

Thanks for your request, @chrysn!

It seems generic memoization CRAN packages could potentially address your need without much specific code.

For example, this seems to fulfill your main requirements:

library(memoise)
library(XLConnect)

# Create a filesystem cache in an appropriate location
cache_dir <- file.path(
  Sys.getenv("XDG_CACHE_HOME", "~/.cache"),
  "xlconnect"
)

xl_cache <- cache_filesystem(cache_dir)

# Memoize the XLConnect loading function
memoized_loadWorkbook <- memoise(loadWorkbook, 
                                cache = xl_cache,
                                # Hash function will include file modification time
                                hash = function(x) {
                                  rlang::hash(list(x, file.mtime(x)))
                                })

See r-lib/memoise or https://www.r-bloggers.com/2020/04/caching-in-r/ for other packages