Open r2evans opened 4 years ago
storr
's key mangling prevents this, e.g. storr_rds(mangle_key = TRUE)
. The downside is that it uses Base64 encoding in R, which is slow, and it does not prevent #95. The base64url
package has fast compiled code for Base64 and Base32 encoding.
drake
uses Base32 encoding from base64url
for files and namespaced functions, but it does not encode target names or non-namespaced function names because those are usually legal file names on Windows.
Ok, I had seen mangle_key
but didn't investigate it, thank you. Yes, the case-sensitivity would definitely be a problem. Another problem with base64 encoding is that it is nearing typical path (not file) name limits. On windows, I think a filename can be upwards of 255 or so, but its full path cannot exceed 260. I think Base64 tends to increase the string length by about 33%.
Here's a question, though: if you're comfortable assuming exclusive write access to the storage directory, and we're talking about mangling, why do we need anything other than an incrementing integer? (Perhaps the assumption of exclusive access is not proper, I'm not familiar with the intent of storr
or your drake
tools that use it.)
Ok, I had seen mangle_key but didn't investigate it, thank you. Yes, the case-sensitivity would definitely be a problem. Another problem with base64 encoding is that it is nearing typical path (not file) name limits. On windows, I think a filename can be upwards of 255 or so, but its full path cannot exceed 260. I think Base64 tends to increase the string length by about 33%.
Yeah, I doubt we are ever going to have it both ways on Windows, re #94.
Here's a question, though: if you're comfortable assuming exclusive write access to the storage directory, and we're talking about mangling, why do we need anything other than an incrementing integer? (Perhaps the assumption of exclusive access is not proper, I'm not familiar with the intent of storr or your drake tools that use it.)
storr_rds()
is designed for parallel writes, and drake
relies on it. Non-RDS backends like storr_dbi()
should be more permissive.
Okay, that make sense. Some more thoughts, since I'm brainstorming.
If I understand the problem correctly, the challenge (for a filesystem-based store) is
Two more thoughts:
uuid
My thought is to create directories (instead of files) using the above, and within each directory house a key
file (text or whatever, doesn't matter) and object.rds
file (and whatever else might be useful). There is an initial buy-in cost of scanning all directories and reading each's key
file, which will be a little more expensive than just scanning for filenames (with or without mangling).
uuid::
has some computational overhead, to be fair. Even a simple test shows it's not fast.
sss <- "some insanely long filename really long with colons::: and stuff that should not be used as a real filename but we cannot risk collisions anyway; hah"
microbenchmark::microbenchmark(
a=replicate(100, base64url::base64_urlencode(sss)),
b=replicate(100, uuid::UUIDgenerate(use.time = TRUE))
)
### results on windows, R-3.5.3
# Unit: microseconds
# expr min lq mean median uq max neval
# a 617 627.6 658.206 634.00 643.85 992.3 100
# b 7066 8374.0 8704.670 8723.55 9007.65 9938.8 100
### results on linux, R-3.5.3
# Unit: microseconds
# expr min lq mean median uq max neval
# a 922.185 934.993 976.7915 944.738 993.08 1322.549 100
# b 982.342 997.552 1037.7294 1006.620 1024.46 1717.477 100
(The use.time=TRUE
is required on windows.)
But this is in generation only, which means that read-performance is not affected (any more than maildir).
If I understand the normal costs of storr_rds
, there's a one-time creation of the directory structure; a per-write cost of creating the keyfile (mangled or otherwise) and the .rds
file; and the cost of maintaining a key <--> hash
relationship in memory. For both of these, the keyfile and rds are co-located, so the biggest difference in "cost" is that the in-memory mapping is actually just key <--> uuid
or key <--> dirname
.
Perhaps I'm over-thinking a windows workaround. Cheers.
(Regardless, I still contend that the package should either/both: (1) document better the problem with some keys as filenames; and (2) actually error (or at least warn) if the file could not be created for any reason.)
https://github.com/richfitz/storr/issues/119#issuecomment-586400490 is clever, but the slowness concerns me (drake
needs to quickly iterate over thousands of targets quickly).
I understand. While not free, I wouldn't think 1 millisecond for 100 UUIDs (on not-windows) would be that expensive. (I have not tried to quantify the cost of indirect keyfiles.) What is drake
's targets-per-second expectation? Even on windows (my laptop), I can generate 11000 UUIDs/sec. On linux, 90000 UUIDs/sec. (I'm sure I'm under-estimating the throughput expectation of drake
.)
Although not always realistic, the goal has always been to approach the speed of GNU Make. In general, I feel reluctant to relax after drake
achieves a certain speed minimum.
Let's keep this conversation going!
I think my suggestion is not so much about the method (base32, uuid, whatever) and more-so about using a unique (not key-derived) directory instead of just a file.
Inside this directory, there would be at least two files: key
and data.rds
. If we look at this through the lens of #5 (specifically the comment about "traits"), it shouldn't be that hard to imagine adding arbitrary files such as ttl
for key/data expiration, or a hash
file (ala #93 ).
Though I don't know if I'll win a discussion about encoding versus UUIDs or such, I had a related thought: tempfile()
. It is not (currently) an atomic unique-and-created operation (unlike /bin/mktemp
), but it's generally pretty good, and either a couple of safe-guards or a quick Rcpp implementation can provide that. Either way, it's pretty fast.
sss <- "some insanely long filename really long with colons::: and stuff that should not be used as a real filename but we cannot risk collisions anyway; hah"
microbenchmark::microbenchmark(
tf = replicate(1000, tempfile()),
b64 = replicate(1000, base64url::base64_urlencode(sss)),
b32 = replicate(1000, base64url::base32_encode(sss)),
u = replicate(1000, uuid::UUIDgenerate(F))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# tf 5.532874 5.668098 5.95888 5.766276 5.889214 12.36697 100
# b64 8.964874 9.078771 10.14977 9.429460 9.556842 17.34754 100
# b32 25.294406 25.724983 29.20379 25.928418 28.271693 172.83802 100
# u 21.542759 21.875666 23.11226 22.155427 22.833333 31.53689 100
microbenchmark::microbenchmark(
tf = tempfile(),
b64 = base64url::base64_urlencode(sss),
b32 = base64url::base32_encode(sss),
u = uuid::UUIDgenerate(F),
times = 1000
)
# Unit: microseconds
# expr min lq mean median uq max neval
# tf 4.056 6.1855 7.358272 7.1215 8.2285 57.033 1000
# b64 9.268 12.5995 14.137023 13.7390 14.9970 230.238 1000
# b32 25.681 29.9730 32.338592 31.6935 33.3975 202.957 1000
# u 20.439 24.1350 25.808665 25.5345 27.0035 84.724 1000
sss <- "something reasonable"
microbenchmark::microbenchmark(
tf = tempfile(),
b64 = base64url::base64_urlencode(sss),
b32 = base64url::base32_encode(sss),
u = uuid::UUIDgenerate(F),
times = 1000
)
# Unit: microseconds
# expr min lq mean median uq max neval
# tf 4.144 5.339 5.978619 5.9835 6.4305 40.971 1000
# b64 7.398 9.398 10.014437 10.0110 10.5450 22.701 1000
# b32 17.444 19.331 20.643182 20.6410 21.5545 50.415 1000
# u 19.262 21.004 21.972433 21.8425 22.5410 72.105 1000
Windows doesn't generally like colons in the filenames. Using
storr_rds
, though, keys with embedded colons are accepted but silently fail.Windows:
Linux:
I would expect one of two behaviors:
KeyError
orKeyWarning
.Number 1 suggests you would have to store the actual key somehow as well, which would likely incur another file read/write in the loop, not something you're eager to do. Perhaps number 2 is best: after file creation is attempted, if
!file.exists
then send a warning or error.At a minimum, I suggest that this externally-imposed limitation should be documented, instead of the current docs which say:
(https://kb.acronis.com/content/39790 suggests that the colon is the only not-allowed character, though I don't know with certainty. Edit: apparently the Japanese yen symbol "¥" is also verboten; that same page suggests '\', '/', '.', '?', and '*' are also bad.)