richfitz / storr

:package: Object cacher for R
http://richfitz.github.io/storr
Other
116 stars 10 forks source link

To pad or not to pad in encode64() #43

Open wlandau opened 7 years ago

wlandau commented 7 years ago

The base64url package forgoes padding in base64_urlencode() because "=" is sometimes unsafe in urls/files. This seems like an extremely small concern, but I am wondering if there is a weird edge case where we will need safer storr key filenames.

For parallelRemake and drake, I cannot pad because "=" messes with Makefile rules (drake issue 19).

richfitz commented 7 years ago

This is fixed on branch develop (though see the notes in the two commits above). Would you mind testing this and see if this works for your use case? In particular, please check that things work as expected on existing storrs; I've tried hard to make sure that nothing here breaks backward compatibility

wlandau commented 7 years ago

I checked a couple important use cases, and they work perfectly. For each of remakeGenerator::example_remakeGenerator("basic") and drake::example_drake("basic"), I ran the project with the current CRAN storr (1.0.1) and verified that all targets remained up to date after updating to the latest develop commit (7761b2e48b803767f926e61a3bf521ffd8b0e4c6). Then I started both examples from scratch using the develop storr and saw that they ran smoothly. Finally, all the devtools checks and unit tests for remakeGenerator, parallelRemake, and drake passed with no warnings, errors, or notes under the develop storr (Mac OS, R 3.3.2).

For my other use case, I was originally using storr::encode64() to solve parallelRemake issue 14 and drake issue 19, but I have since decided to use base64_urlencode() from base64url instead. (I noticed that base64url uses C code and focuses on speed. Do you think storr might see some performance gains if you swap out encode64() and decode64()?)

I also compared the some of the output keys for both versions of storr. Padded filenames from the CRAN version were not padded in the develop version, so I will close this issue.

Thanks!

richfitz commented 7 years ago

Thanks for the extensive testing - much appreciated!

I noticed that base64url uses C code and focuses on speed. Do you think storr might see some performance gains if you swap out encode64() and decode64()?

I had thought that the speed difference here would not be too bad, but it will be worth thinking about.

I can't immediately base64_urlencode because it can't pad strings which would break existing storrs. But in the future I can move to use that package (though in general I don't want to increase the dependencies any more than absolutely necessary because I it complicates deployment; depending on base64url would drag in only two more packages and they're small so that's not a huge deal.

It looks like for trivial writes, the encoding is a bit of a timesink

st <- storr::storr_rds(tempfile(), mangle_key = TRUE)
x <- paste(letters, collapse = "")
Rprof(interval = 1/200)
for (i in 1:5000) {
  st$set(x, 1, use_cache = FALSE)
}
Rprof(NULL)
summaryRprof()

which gives (relevant lines only)

$by.total
                        total.time total.pct self.time self.pct
"st$set"                     1.295     99.23     0.025     1.92
"<Anonymous>"                0.935     71.65     0.010     0.77
"writeLines"                 0.645     49.43     0.025     1.92
"self$name_key"              0.455     34.87     0.010     0.77
"encode64"                   0.385     29.50     0.160    12.26
"self$set_value"             0.380     29.12     0.025     1.92

And then comparing all the bits:

st0 <- storr::storr_rds(tempfile(), mangle_key = FALSE)
microbenchmark::microbenchmark(
  base64url::base64_urlencode(x),
  storr::encode64(x, pad = FALSE),
  st$set(x, 1, use_cache = FALSE),
  st0$set(x, 1, use_cache = FALSE))

which gives

Unit: microseconds
                             expr     min       lq      mean   median       uq
   base64url::base64_urlencode(x)   4.506   7.7395  10.43955  10.2810  11.2580
  storr::encode64(x, pad = FALSE)  33.024  39.3780  49.30678  49.6800  55.3660
  st$set(x, 1, use_cache = FALSE) 202.651 224.3310 289.50977 231.9120 242.2800
 st0$set(x, 1, use_cache = FALSE) 142.946 154.3020 166.26774 159.4535 169.6705
      max neval
   29.228   100
   87.909   100
 4422.203   100
  358.444   100

So there's a nontrivial time being taken there that could be removed. I'll continue further thoughts on this in #44

wlandau commented 7 years ago

I am happy to test where I can, and I appreciate the profiling and interest.

wlandau commented 5 years ago

I posted some more recent benchmarks at https://github.com/richfitz/storr/issues/44#issuecomment-435636219. In my experience using storr through drake, the bottleneck due to key mangling seems quite severe for storrs with lots of tiny objects.

Would you be open to a PR with base64url + manual key padding?