Closed yutannihilation closed 3 years ago
We have a commitment to support the last 4 versions of R. I think we should look into dropping digest
Or alternatively help the digest package support 3.2? It seems a minor issue that caused them to bump the version: https://github.com/eddelbuettel/digest/issues/159
Thanks, but, though it doesn't seem to be very difficult to implement a new startsWith()
, I don't feel this is something worth asking the maintainer to release the new version for us because, as @thomasp85 says, R 3.2 is out of the tidyverse's "4 last versions" rule.
Thomas said "dropping digest" instead of "dropping support for R 3.2". Maybe we're all a bit confused. I just looked up the past R versions and I agree we should be fine with dropping 3.2:
Thomas said "dropping digest"
Oh, true... Definitely, I'm a bit confused, sorry.
Sorry. The confusion was my doing. We are not bound by our support commitment. Still, I think it would be beneficial to investigate whether dropping digest is viable from a performance point of view
So digest()
basically calls serialize()
on the object and then creates a hash value. Can we just use the serialized string?
x <- list(1, TRUE, "hello")
paste0(serialize(x, NULL, TRUE), collapse = "")
#> [1] "410a330a3236323134360a3139373838380a350a5554462d380a31390a330a31340a310a310a31300a310a310a31360a310a3236323135330a350a68656c6c6f0a"
Created on 2020-10-25 by the reprex package (v0.3.0)
One problem may be that the string can be of arbitrary length. One way around this would be to write a digestlight package that takes the string and runs it through a C++ hash function. That would be maybe 100 lines of code total, including everything needed to have a complete package.
Some benchmarking to see how fast the various parts are. serialize()
doesn't seem to be the bottleneck, but just paste0()
on the serialized object is quite expensive.
x <- list(1, TRUE, "hello", 1:10)
microbenchmark::microbenchmark(
digest::digest(x),
paste0(serialize(x, NULL, TRUE), collapse = ""),
serialize(x, NULL, TRUE)
)
#> Unit: microseconds
#> expr min lq mean median
#> digest::digest(x) 36.277 41.181 60.45034 43.9495
#> paste0(serialize(x, NULL, TRUE), collapse = "") 37.960 41.153 46.93491 43.1825
#> serialize(x, NULL, TRUE) 7.588 10.590 13.70226 11.9500
#> uq max neval
#> 72.332 524.315 100
#> 50.000 97.516 100
#> 15.307 37.839 100
x <- list(1, TRUE, "hello", 1:10000)
microbenchmark::microbenchmark(
digest::digest(x),
paste0(serialize(x, NULL, TRUE), collapse = ""),
serialize(x, NULL, TRUE)
)
#> Unit: microseconds
#> expr min lq mean
#> digest::digest(x) 238.258 263.8265 280.24610
#> paste0(serialize(x, NULL, TRUE), collapse = "") 37.359 40.7460 45.37433
#> serialize(x, NULL, TRUE) 7.605 8.5340 10.56713
#> median uq max neval
#> 268.4350 283.1045 408.335 100
#> 43.0250 45.0730 90.078 100
#> 10.1015 11.7575 25.245 100
Created on 2020-10-25 by the reprex package (v0.3.0)
I didn't notice caching is this costly... Interesting. What about just using rawToChar()
? Calculating BASE64 string on a non-ASCII binary seems the fastest, but considering there's the cost of ::
, there's no significant difference.
library(base64enc)
library(digest)
x <- list(1, TRUE, "hello", 1:10000)
microbenchmark::microbenchmark(
digest_w_colon = digest::digest(x),
digest_wo_colon = digest(x),
paste = paste0(serialize(x, NULL, TRUE), collapse = ""),
rawToChar = rawToChar(serialize(x, NULL, TRUE)),
base64_w_colon = base64enc::base64encode(serialize(x, NULL, FALSE)),
base64_wo_colon =base64encode(serialize(x, NULL, FALSE)),
serialize_ascii = serialize(x, NULL, TRUE),
serialize_non_ascii = serialize(x, NULL, FALSE)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> digest_w_colon 192.649 206.1580 212.11704 210.9590 219.8110 247.720 100
#> digest_wo_colon 185.873 199.6800 206.76998 203.5015 211.4220 500.502 100
#> paste 49.019 52.3120 54.04512 53.3200 54.8050 69.588 100
#> rawToChar 11.499 12.5710 13.71171 13.4775 14.4275 19.482 100
#> base64_w_colon 11.293 13.5240 15.12000 14.7140 15.8705 28.796 100
#> base64_wo_colon 5.179 6.2515 9.19496 7.3140 8.3750 183.277 100
#> serialize_ascii 10.583 11.4225 12.76870 12.4195 13.7135 26.942 100
#> serialize_non_ascii 3.029 3.5310 4.43354 4.1255 4.6590 8.929 100
#> cld
#> e
#> e
#> d
#> bc
#> c
#> ab
#> bc
#> a
Created on 2020-10-27 by the reprex package (v0.3.0)
If I replace digest::digest()
with rawToChar(serialize(x, NULL, TRUE))
, I got this error:
Error in do.call("unit.c", lapply(ggrobs, function(g) sum(g$widths))) :
variable names are limited to 10000 bytes
What's worse, this isn't very fast. Here is the result on a realistic data on which I got the error above. ascii = TRUE
seems very expensive.
library(base64enc)
library(digest)
x <- list(
"z",
as.character(1:6),
data.frame(
colour = scales::viridis_pal()(300),
value = seq(1, 6, length.out = 300)
),
"colourbar"
)
microbenchmark::microbenchmark(
digest_w_colon = digest::digest(x),
digest_wo_colon = digest(x),
digest_xxhash32 = digest::digest(x, "xxhash32"),
digest_xxhash64 = digest::digest(x, "xxhash64"),
serialize_ascii = serialize(x, NULL, TRUE),
serialize_bin = serialize(x, NULL, FALSE),
rawToChar = rawToChar(serialize(x, NULL, TRUE))
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> digest_w_colon 92.271 99.5690 104.68697 101.8930 107.1105 135.390 100
#> digest_wo_colon 82.708 88.3015 99.75531 89.7285 92.5880 615.140 100
#> digest_xxhash32 69.879 78.3815 84.60217 80.7615 85.5825 146.478 100
#> digest_xxhash64 68.775 77.6865 84.72259 80.3860 85.7910 145.938 100
#> serialize_ascii 650.601 682.8550 742.17401 692.8080 729.6420 1255.246 100
#> serialize_bin 36.854 40.3000 43.25900 42.7725 44.7235 72.391 100
#> rawToChar 686.606 718.9860 763.60650 728.8495 755.1025 1278.913 100
#> cld
#> b
#> b
#> b
#> b
#> c
#> a
#> c
Created on 2020-11-02 by the reprex package (v0.3.0)
This is weird. When I looked at the digest()
source code, I thought it starts by calling serialize()
. Does it call the binary version? Does this work: rawToChar(serialize(x, NULL, FALSE))
?
To answer my own question: No, it doesn't work, because C strings cannot contain binary zeros.
Yeah. There might be nicer way to convert a binary into ASCII (e.g. BASE64 encoding), but I think variable names are limited to 10000 bytes
means we need some hashing anyway to ensure the length of the result.
Implementing a hashing algorithm doesn't seem trivial, and digest package is already very fast; it took only twice than bare serialization (serialize(x, NULL, FALSE)
). Though it would probably get a bit faster by using XXH3, a latest variant provided by xxHash (digest uses xxHash, but only provides XXH32 and XXH64), I don't think it's worth implementing another package just for this as there's little room for speed up. Serialization might be improved by not relying on serialize()
, but I have no idea at the moment.
Can we agree to drop support for R 3.2? We might eventually implement another hashing package, but I don't think it will happen in the near future. So, I think it's better to make it clear that ggplot2 cannot be installed on R 3.2, at least at the moment.
I think dropping support for 3.2 is fine for now. flange might eventually get a hashing function but I don't know the timeline for it
Thanks. Is "flange" an R package that is not yet made public? Sounds interesting.
Serialization might be improved by not relying on
serialize()
, but I have no idea at the moment.
In case someone is interested in this, xxhashlite serializes without serialize()
and the benchmark on README looks much faster than digest.
As I commented on https://github.com/tidyverse/ggplot2/pull/4242#issuecomment-714070558, digest package now requires R >= 3.3.
digest::digest()
is used for hashing guides so that we can merge overlapped guides byguides_merge()
.https://github.com/tidyverse/ggplot2/blob/7e51849b010b34cf07f1e4995d7e64e6e1cd5b55/R/guides-.r#L231-L244
Since the hashing we need is very simple, it's probably possible to implement another hashing function (Maybe
paste()
with some nice separator is enough...?), but I'm not sure if it's worth. I feel it's time to consider dropping support for R 3.2. If this sounds good, I'll prepare a pull request.