tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.51k stars 2.03k forks source link

Drop support for R 3.2 #4247

Closed yutannihilation closed 3 years ago

yutannihilation commented 4 years ago

As I commented on https://github.com/tidyverse/ggplot2/pull/4242#issuecomment-714070558, digest package now requires R >= 3.3. digest::digest() is used for hashing guides so that we can merge overlapped guides by guides_merge().

https://github.com/tidyverse/ggplot2/blob/7e51849b010b34cf07f1e4995d7e64e6e1cd5b55/R/guides-.r#L231-L244

Since the hashing we need is very simple, it's probably possible to implement another hashing function (Maybe paste() with some nice separator is enough...?), but I'm not sure if it's worth. I feel it's time to consider dropping support for R 3.2. If this sounds good, I'll prepare a pull request.

thomasp85 commented 4 years ago

We have a commitment to support the last 4 versions of R. I think we should look into dropping digest

clauswilke commented 4 years ago

Or alternatively help the digest package support 3.2? It seems a minor issue that caused them to bump the version: https://github.com/eddelbuettel/digest/issues/159

yutannihilation commented 4 years ago

Thanks, but, though it doesn't seem to be very difficult to implement a new startsWith(), I don't feel this is something worth asking the maintainer to release the new version for us because, as @thomasp85 says, R 3.2 is out of the tidyverse's "4 last versions" rule.

clauswilke commented 4 years ago

Thomas said "dropping digest" instead of "dropping support for R 3.2". Maybe we're all a bit confused. I just looked up the past R versions and I agree we should be fine with dropping 3.2:

yutannihilation commented 4 years ago

Thomas said "dropping digest"

Oh, true... Definitely, I'm a bit confused, sorry.

thomasp85 commented 4 years ago

Sorry. The confusion was my doing. We are not bound by our support commitment. Still, I think it would be beneficial to investigate whether dropping digest is viable from a performance point of view

clauswilke commented 4 years ago

So digest() basically calls serialize() on the object and then creates a hash value. Can we just use the serialized string?

x <- list(1, TRUE, "hello")
paste0(serialize(x, NULL, TRUE), collapse = "")
#> [1] "410a330a3236323134360a3139373838380a350a5554462d380a31390a330a31340a310a310a31300a310a310a31360a310a3236323135330a350a68656c6c6f0a"

Created on 2020-10-25 by the reprex package (v0.3.0)

One problem may be that the string can be of arbitrary length. One way around this would be to write a digestlight package that takes the string and runs it through a C++ hash function. That would be maybe 100 lines of code total, including everything needed to have a complete package.

clauswilke commented 4 years ago

Some benchmarking to see how fast the various parts are. serialize() doesn't seem to be the bottleneck, but just paste0() on the serialized object is quite expensive.

x <- list(1, TRUE, "hello", 1:10)

microbenchmark::microbenchmark(
  digest::digest(x),
  paste0(serialize(x, NULL, TRUE), collapse = ""),
  serialize(x, NULL, TRUE)
)
#> Unit: microseconds
#>                                             expr    min     lq     mean  median
#>                                digest::digest(x) 36.277 41.181 60.45034 43.9495
#>  paste0(serialize(x, NULL, TRUE), collapse = "") 37.960 41.153 46.93491 43.1825
#>                         serialize(x, NULL, TRUE)  7.588 10.590 13.70226 11.9500
#>      uq     max neval
#>  72.332 524.315   100
#>  50.000  97.516   100
#>  15.307  37.839   100

x <- list(1, TRUE, "hello", 1:10000)

microbenchmark::microbenchmark(
  digest::digest(x),
  paste0(serialize(x, NULL, TRUE), collapse = ""),
  serialize(x, NULL, TRUE)
)
#> Unit: microseconds
#>                                             expr     min       lq      mean
#>                                digest::digest(x) 238.258 263.8265 280.24610
#>  paste0(serialize(x, NULL, TRUE), collapse = "")  37.359  40.7460  45.37433
#>                         serialize(x, NULL, TRUE)   7.605   8.5340  10.56713
#>    median       uq     max neval
#>  268.4350 283.1045 408.335   100
#>   43.0250  45.0730  90.078   100
#>   10.1015  11.7575  25.245   100

Created on 2020-10-25 by the reprex package (v0.3.0)

yutannihilation commented 4 years ago

I didn't notice caching is this costly... Interesting. What about just using rawToChar()? Calculating BASE64 string on a non-ASCII binary seems the fastest, but considering there's the cost of ::, there's no significant difference.

library(base64enc)
library(digest)

x <- list(1, TRUE, "hello", 1:10000)

microbenchmark::microbenchmark(
  digest_w_colon = digest::digest(x),
  digest_wo_colon = digest(x),
  paste = paste0(serialize(x, NULL, TRUE), collapse = ""),
  rawToChar = rawToChar(serialize(x, NULL, TRUE)),
  base64_w_colon = base64enc::base64encode(serialize(x, NULL, FALSE)),
  base64_wo_colon =base64encode(serialize(x, NULL, FALSE)),
  serialize_ascii = serialize(x, NULL, TRUE),
  serialize_non_ascii = serialize(x, NULL, FALSE)
)
#> Unit: microseconds
#>                 expr     min       lq      mean   median       uq     max neval
#>       digest_w_colon 192.649 206.1580 212.11704 210.9590 219.8110 247.720   100
#>      digest_wo_colon 185.873 199.6800 206.76998 203.5015 211.4220 500.502   100
#>                paste  49.019  52.3120  54.04512  53.3200  54.8050  69.588   100
#>            rawToChar  11.499  12.5710  13.71171  13.4775  14.4275  19.482   100
#>       base64_w_colon  11.293  13.5240  15.12000  14.7140  15.8705  28.796   100
#>      base64_wo_colon   5.179   6.2515   9.19496   7.3140   8.3750 183.277   100
#>      serialize_ascii  10.583  11.4225  12.76870  12.4195  13.7135  26.942   100
#>  serialize_non_ascii   3.029   3.5310   4.43354   4.1255   4.6590   8.929   100
#>    cld
#>      e
#>      e
#>     d 
#>   bc  
#>    c  
#>  ab   
#>   bc  
#>  a

Created on 2020-10-27 by the reprex package (v0.3.0)

yutannihilation commented 4 years ago

If I replace digest::digest() with rawToChar(serialize(x, NULL, TRUE)), I got this error:

Error in do.call("unit.c", lapply(ggrobs, function(g) sum(g$widths))) : 
  variable names are limited to 10000 bytes

What's worse, this isn't very fast. Here is the result on a realistic data on which I got the error above. ascii = TRUE seems very expensive.

library(base64enc)
library(digest)

x <- list(
  "z",
  as.character(1:6),
  data.frame(
    colour = scales::viridis_pal()(300),
    value  = seq(1, 6, length.out = 300)
  ),
  "colourbar"
)

microbenchmark::microbenchmark(
  digest_w_colon  = digest::digest(x),
  digest_wo_colon = digest(x),
  digest_xxhash32 = digest::digest(x, "xxhash32"),
  digest_xxhash64 = digest::digest(x, "xxhash64"),
  serialize_ascii = serialize(x, NULL, TRUE),
  serialize_bin   = serialize(x, NULL, FALSE),
  rawToChar       = rawToChar(serialize(x, NULL, TRUE))
)
#> Unit: microseconds
#>             expr     min       lq      mean   median       uq      max neval
#>   digest_w_colon  92.271  99.5690 104.68697 101.8930 107.1105  135.390   100
#>  digest_wo_colon  82.708  88.3015  99.75531  89.7285  92.5880  615.140   100
#>  digest_xxhash32  69.879  78.3815  84.60217  80.7615  85.5825  146.478   100
#>  digest_xxhash64  68.775  77.6865  84.72259  80.3860  85.7910  145.938   100
#>  serialize_ascii 650.601 682.8550 742.17401 692.8080 729.6420 1255.246   100
#>    serialize_bin  36.854  40.3000  43.25900  42.7725  44.7235   72.391   100
#>        rawToChar 686.606 718.9860 763.60650 728.8495 755.1025 1278.913   100
#>  cld
#>   b 
#>   b 
#>   b 
#>   b 
#>    c
#>  a  
#>    c

Created on 2020-11-02 by the reprex package (v0.3.0)

clauswilke commented 4 years ago

This is weird. When I looked at the digest() source code, I thought it starts by calling serialize(). Does it call the binary version? Does this work: rawToChar(serialize(x, NULL, FALSE))?

clauswilke commented 4 years ago

To answer my own question: No, it doesn't work, because C strings cannot contain binary zeros.

yutannihilation commented 4 years ago

Yeah. There might be nicer way to convert a binary into ASCII (e.g. BASE64 encoding), but I think variable names are limited to 10000 bytes means we need some hashing anyway to ensure the length of the result.

Implementing a hashing algorithm doesn't seem trivial, and digest package is already very fast; it took only twice than bare serialization (serialize(x, NULL, FALSE)). Though it would probably get a bit faster by using XXH3, a latest variant provided by xxHash (digest uses xxHash, but only provides XXH32 and XXH64), I don't think it's worth implementing another package just for this as there's little room for speed up. Serialization might be improved by not relying on serialize(), but I have no idea at the moment.

yutannihilation commented 3 years ago

Can we agree to drop support for R 3.2? We might eventually implement another hashing package, but I don't think it will happen in the near future. So, I think it's better to make it clear that ggplot2 cannot be installed on R 3.2, at least at the moment.

thomasp85 commented 3 years ago

I think dropping support for 3.2 is fine for now. flange might eventually get a hashing function but I don't know the timeline for it

yutannihilation commented 3 years ago

Thanks. Is "flange" an R package that is not yet made public? Sounds interesting.

yutannihilation commented 3 years ago

Serialization might be improved by not relying on serialize(), but I have no idea at the moment.

In case someone is interested in this, xxhashlite serializes without serialize() and the benchmark on README looks much faster than digest.

https://github.com/coolbutuseless/xxhashlite