"Memoization" on elements in a vector input.

orgadish commented 1 year ago

I'm not sure if memoise is the appropriate place for this but when I suggested it in purrr, Hadley suggested memoization would be a better approach for the issue. However, memoization currently acts on the entire input to a function, without accounting for repeats in the input.

This issue came about when I discovered that fs::path_file and fs::path_dir run very slowly on Windows (see r-lib/fs#424), and since most of my use case of these functions is after using readr::read_csv(files, .id="file_path"), most of the vector is duplicated. As such, I found that I could save a significant amount of time by deduplicating the vector (2x on Mac, 40x on Windows). This approach is not just helpful for fs::path_ functions.

The most straightforward approach is:

with_deduplication <- function(f) {
  function(x, ...) {
    ux <- unique(x)
    f(ux, ...)[match(x, ux)]
  }
}

I've also submitted a PR into vctrs to speed this up (see r-lib/vctrs#1857 and r-lib/vctrs#1858).

While traditional "Memoization" is typically performed blindly on the inputs, most programming languages aren't inherently vectorized like R. Therefore, I think it would make sense for memoise to add this extra feature to its memoization, such that it cached any input that matches the unique input. Or at least a new function, say memoise_unique since calculating unique every time takes some extra time.

wch commented 1 year ago

I've also encountered issues with slow fs performance in the past, although I wonder why it's so slow on Windows in the example at https://github.com/r-lib/fs/issues/424, given that it's not actually doing any filesystem operations. In some cases I've had to move completely away from using fs, and instead use base R file operations for higher performance.

I think your proposed function is more specific and narrow than makes sense for the memoise package -- for example, instead of allowing any kind of input object, it requires that the input x is a vector.

orgadish commented 1 year ago

I guess what I was thinking is it's just a form of memoization on the elements of the input themselves. So memoize_unique would always memoize on unique(x) rather than x. It would then also handle de- and re-duplication back to the input passed in.

Perhaps both the original x and the unique x can be stored to save the unique(x) call every time, but I'm not sure that would help the primary use case.

orgadish commented 1 year ago

@wch To follow up — the fs functions are just wrappers around the base R functions and so unsurprisingly, the 30x difference in performance is maintained using the base basename, for example. See my comment in r-lib/fs#424 for the specific benchmarking.

orgadish commented 1 year ago

To anyone that comes across this and is looking for a solution, I've written a separate package, deduped to implement this:

# if(!requireNamespace("deduped")) install.packages("deduped")

library(deduped)

N_TOTAL <- 1e4
repeated_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:10), "inner") |> 
  rep(N_TOTAL/10) |> 
  sample()

bench::mark(
  direct = repeated_paths |> fs::path_dir(),
  indirect = repeated_paths |> deduped(fs::path_dir)(),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 direct       51.8ms   52.5ms      18.9  749.02KB     2.10
#> 2 indirect    206.3µs  213.5µs    4574.     6.13MB     0

all_unique_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:N_TOTAL), "inner")

bench::mark(
  direct = all_unique_paths |> fs::path_dir(),
  indirect = all_unique_paths |> deduped(fs::path_dir)(),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 direct       53.6ms   54.6ms      18.3  901.88KB     0   
#> 2 indirect     53.6ms   54.9ms      18.2    1.03MB     2.02

^{Created on 2023-10-26 with reprex v2.0.2}

r-lib / memoise

"Memoization" on elements in a vector input. #149