richfitz / storr

:package: Object cacher for R
http://richfitz.github.io/storr
Other
116 stars 10 forks source link

illegal file names, silent failure #119

Open r2evans opened 4 years ago

r2evans commented 4 years ago

Windows doesn't generally like colons in the filenames. Using storr_rds, though, keys with embedded colons are accepted but silently fail.

Windows:

p <- tempfile()
st <- storr::storr_rds(p)
st$set("mt", mtcars[1:2,])
st$set("foo:bar", iris[1:3,])
list.files(file.path(p, "keys", "objects"))
# [1] "mt"
st$get("foo:bar")
# Error: key 'foo:bar' ('objects') not found

Linux:

p <- tempfile()
st <- storr::storr_rds(p)
st$set("mt", mtcars[1:2,])
st$set("foo:bar", iris[1:3,])
list.files(file.path(p, "keys", "objects"))
# [1] "foo:bar" "mt"     
st$get("foo:bar")
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa

I would expect one of two behaviors:

  1. The key is encoded so that windows will accommodate it as a file on its filesystem. This introduces the potential for key collisions, so this is not without risk.
    • Possible workaround: use the hash of the key as a filename. While this does not change the possibility of collision, it also allows arbitrarily-long key strings (I think 255 for many filesystems). (One of my use-cases involves keying on multiple UUIDs (36 char each), and while I do not foresee combining more than six for a single key, I don't think that that is impossible.)
  2. KeyError or KeyWarning.

Number 1 suggests you would have to store the actual key somehow as well, which would likely incur another file read/write in the loop, not something you're eager to do. Perhaps number 2 is best: after file creation is attempted, if !file.exists then send a warning or error.

At a minimum, I suggest that this externally-imposed limitation should be documented, instead of the current docs which say:

'key': The key name.  Can be any string.

(https://kb.acronis.com/content/39790 suggests that the colon is the only not-allowed character, though I don't know with certainty. Edit: apparently the Japanese yen symbol "¥" is also verboten; that same page suggests '\', '/', '.', '?', and '*' are also bad.)

wlandau-lilly commented 4 years ago

storr's key mangling prevents this, e.g. storr_rds(mangle_key = TRUE). The downside is that it uses Base64 encoding in R, which is slow, and it does not prevent #95. The base64url package has fast compiled code for Base64 and Base32 encoding.

drake uses Base32 encoding from base64url for files and namespaced functions, but it does not encode target names or non-namespaced function names because those are usually legal file names on Windows.

r2evans commented 4 years ago

Ok, I had seen mangle_key but didn't investigate it, thank you. Yes, the case-sensitivity would definitely be a problem. Another problem with base64 encoding is that it is nearing typical path (not file) name limits. On windows, I think a filename can be upwards of 255 or so, but its full path cannot exceed 260. I think Base64 tends to increase the string length by about 33%.

Here's a question, though: if you're comfortable assuming exclusive write access to the storage directory, and we're talking about mangling, why do we need anything other than an incrementing integer? (Perhaps the assumption of exclusive access is not proper, I'm not familiar with the intent of storr or your drake tools that use it.)

wlandau commented 4 years ago

Ok, I had seen mangle_key but didn't investigate it, thank you. Yes, the case-sensitivity would definitely be a problem. Another problem with base64 encoding is that it is nearing typical path (not file) name limits. On windows, I think a filename can be upwards of 255 or so, but its full path cannot exceed 260. I think Base64 tends to increase the string length by about 33%.

Yeah, I doubt we are ever going to have it both ways on Windows, re #94.

Here's a question, though: if you're comfortable assuming exclusive write access to the storage directory, and we're talking about mangling, why do we need anything other than an incrementing integer? (Perhaps the assumption of exclusive access is not proper, I'm not familiar with the intent of storr or your drake tools that use it.)

storr_rds() is designed for parallel writes, and drake relies on it. Non-RDS backends like storr_dbi() should be more permissive.

r2evans commented 4 years ago

Okay, that make sense. Some more thoughts, since I'm brainstorming.

If I understand the problem correctly, the challenge (for a filesystem-based store) is

Two more thoughts:

  1. Maildir-like filenames (the rest of the maildir structure is not required, just the filename convention that assures uniqueness).
  2. uuid

My thought is to create directories (instead of files) using the above, and within each directory house a key file (text or whatever, doesn't matter) and object.rds file (and whatever else might be useful). There is an initial buy-in cost of scanning all directories and reading each's key file, which will be a little more expensive than just scanning for filenames (with or without mangling).

uuid:: has some computational overhead, to be fair. Even a simple test shows it's not fast.

sss <- "some insanely long filename really long with colons::: and stuff that should not be used as a real filename but we cannot risk collisions anyway; hah"
microbenchmark::microbenchmark(
  a=replicate(100, base64url::base64_urlencode(sss)),
  b=replicate(100, uuid::UUIDgenerate(use.time = TRUE))
)
### results on windows, R-3.5.3
# Unit: microseconds
#  expr  min     lq     mean  median      uq    max neval
#     a  617  627.6  658.206  634.00  643.85  992.3   100
#     b 7066 8374.0 8704.670 8723.55 9007.65 9938.8   100

### results on linux, R-3.5.3
# Unit: microseconds
#  expr     min      lq      mean   median      uq      max neval
#     a 922.185 934.993  976.7915  944.738  993.08 1322.549   100
#     b 982.342 997.552 1037.7294 1006.620 1024.46 1717.477   100

(The use.time=TRUE is required on windows.) But this is in generation only, which means that read-performance is not affected (any more than maildir).

If I understand the normal costs of storr_rds, there's a one-time creation of the directory structure; a per-write cost of creating the keyfile (mangled or otherwise) and the .rds file; and the cost of maintaining a key <--> hash relationship in memory. For both of these, the keyfile and rds are co-located, so the biggest difference in "cost" is that the in-memory mapping is actually just key <--> uuid or key <--> dirname.

Perhaps I'm over-thinking a windows workaround. Cheers.

r2evans commented 4 years ago

(Regardless, I still contend that the package should either/both: (1) document better the problem with some keys as filenames; and (2) actually error (or at least warn) if the file could not be created for any reason.)

wlandau commented 4 years ago

https://github.com/richfitz/storr/issues/119#issuecomment-586400490 is clever, but the slowness concerns me (drake needs to quickly iterate over thousands of targets quickly).

r2evans commented 4 years ago

I understand. While not free, I wouldn't think 1 millisecond for 100 UUIDs (on not-windows) would be that expensive. (I have not tried to quantify the cost of indirect keyfiles.) What is drake's targets-per-second expectation? Even on windows (my laptop), I can generate 11000 UUIDs/sec. On linux, 90000 UUIDs/sec. (I'm sure I'm under-estimating the throughput expectation of drake.)

wlandau commented 4 years ago

Although not always realistic, the goal has always been to approach the speed of GNU Make. In general, I feel reluctant to relax after drake achieves a certain speed minimum.

r2evans commented 4 years ago

Let's keep this conversation going!

I think my suggestion is not so much about the method (base32, uuid, whatever) and more-so about using a unique (not key-derived) directory instead of just a file.

Inside this directory, there would be at least two files: key and data.rds. If we look at this through the lens of #5 (specifically the comment about "traits"), it shouldn't be that hard to imagine adding arbitrary files such as ttl for key/data expiration, or a hash file (ala #93 ).

Though I don't know if I'll win a discussion about encoding versus UUIDs or such, I had a related thought: tempfile(). It is not (currently) an atomic unique-and-created operation (unlike /bin/mktemp), but it's generally pretty good, and either a couple of safe-guards or a quick Rcpp implementation can provide that. Either way, it's pretty fast.

sss <- "some insanely long filename really long with colons::: and stuff that should not be used as a real filename but we cannot risk collisions anyway; hah"
microbenchmark::microbenchmark(
  tf  = replicate(1000, tempfile()),
  b64 = replicate(1000, base64url::base64_urlencode(sss)),
  b32 = replicate(1000, base64url::base32_encode(sss)),
  u   = replicate(1000, uuid::UUIDgenerate(F))
)
# Unit: milliseconds
#  expr       min        lq     mean    median        uq       max neval
#    tf  5.532874  5.668098  5.95888  5.766276  5.889214  12.36697   100
#   b64  8.964874  9.078771 10.14977  9.429460  9.556842  17.34754   100
#   b32 25.294406 25.724983 29.20379 25.928418 28.271693 172.83802   100
#     u 21.542759 21.875666 23.11226 22.155427 22.833333  31.53689   100
microbenchmark::microbenchmark(
  tf  = tempfile(),
  b64 = base64url::base64_urlencode(sss),
  b32 = base64url::base32_encode(sss),
  u   = uuid::UUIDgenerate(F),
  times = 1000
)
# Unit: microseconds
#  expr    min      lq      mean  median      uq     max neval
#    tf  4.056  6.1855  7.358272  7.1215  8.2285  57.033  1000
#   b64  9.268 12.5995 14.137023 13.7390 14.9970 230.238  1000
#   b32 25.681 29.9730 32.338592 31.6935 33.3975 202.957  1000
#     u 20.439 24.1350 25.808665 25.5345 27.0035  84.724  1000

sss <- "something reasonable"
microbenchmark::microbenchmark(
  tf  = tempfile(),
  b64 = base64url::base64_urlencode(sss),
  b32 = base64url::base32_encode(sss),
  u   = uuid::UUIDgenerate(F),
  times = 1000
)
# Unit: microseconds
#  expr    min     lq      mean  median      uq    max neval
#    tf  4.144  5.339  5.978619  5.9835  6.4305 40.971  1000
#   b64  7.398  9.398 10.014437 10.0110 10.5450 22.701  1000
#   b32 17.444 19.331 20.643182 20.6410 21.5545 50.415  1000
#     u 19.262 21.004 21.972433 21.8425 22.5410 72.105  1000