Closed orgadish closed 1 year ago
I've also encountered issues with slow fs
performance in the past, although I wonder why it's so slow on Windows in the example at https://github.com/r-lib/fs/issues/424, given that it's not actually doing any filesystem operations. In some cases I've had to move completely away from using fs
, and instead use base R file operations for higher performance.
I think your proposed function is more specific and narrow than makes sense for the memoise
package -- for example, instead of allowing any kind of input object, it requires that the input x
is a vector.
I guess what I was thinking is it's just a form of memoization on the elements of the input themselves. So memoize_unique
would always memoize on unique(x)
rather than x. It would then also handle de- and re-duplication back to the input passed in.
Perhaps both the original x and the unique x can be stored to save the unique(x)
call every time, but I'm not sure that would help the primary use case.
@wch To follow up — the fs
functions are just wrappers around the base R functions and so unsurprisingly, the 30x difference in performance is maintained using the base basename
, for example. See my comment in r-lib/fs#424 for the specific benchmarking.
To anyone that comes across this and is looking for a solution, I've written a separate package, deduped
to implement this:
# if(!requireNamespace("deduped")) install.packages("deduped")
library(deduped)
N_TOTAL <- 1e4
repeated_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:10), "inner") |>
rep(N_TOTAL/10) |>
sample()
bench::mark(
direct = repeated_paths |> fs::path_dir(),
indirect = repeated_paths |> deduped(fs::path_dir)(),
iterations = 10
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 direct 51.8ms 52.5ms 18.9 749.02KB 2.10
#> 2 indirect 206.3µs 213.5µs 4574. 6.13MB 0
all_unique_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:N_TOTAL), "inner")
bench::mark(
direct = all_unique_paths |> fs::path_dir(),
indirect = all_unique_paths |> deduped(fs::path_dir)(),
iterations = 10
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 direct 53.6ms 54.6ms 18.3 901.88KB 0
#> 2 indirect 53.6ms 54.9ms 18.2 1.03MB 2.02
Created on 2023-10-26 with reprex v2.0.2
I'm not sure if
memoise
is the appropriate place for this but when I suggested it inpurrr
, Hadley suggested memoization would be a better approach for the issue. However, memoization currently acts on the entire input to a function, without accounting for repeats in the input.This issue came about when I discovered that
fs::path_file
andfs::path_dir
run very slowly on Windows (see r-lib/fs#424), and since most of my use case of these functions is after usingreadr::read_csv(files, .id="file_path")
, most of the vector is duplicated. As such, I found that I could save a significant amount of time by deduplicating the vector (2x on Mac, 40x on Windows). This approach is not just helpful forfs::path_
functions.The most straightforward approach is:
I've also submitted a PR into
vctrs
to speed this up (see r-lib/vctrs#1857 and r-lib/vctrs#1858).While traditional "Memoization" is typically performed blindly on the inputs, most programming languages aren't inherently vectorized like R. Therefore, I think it would make sense for
memoise
to add this extra feature to its memoization, such that it cached any input that matches theunique
input. Or at least a new function, saymemoise_unique
since calculatingunique
every time takes some extra time.