r-quantities / errors

Uncertainty Propagation for R Vectors
https://r-quantities.github.io/errors
Other
49 stars 8 forks source link

Large fluctuation in runtime when summarizing across many groups #39

Closed vlahm closed 4 years ago

vlahm commented 4 years ago

Observed on Ubuntu 18.04 and Windows 10 with R 3.6.3 and errors 0.3.4.

library(errors)

x = 1:500
errors(x) = 0.1
grp = as.character(x)

for(i in 1:10){
    starttime = Sys.time()
    tapply(x, grp, mean, simplify=FALSE)
    print(Sys.time() - starttime)
}
#> Time difference of 12.73055 secs
#> Time difference of 0.1144826 secs
#> Time difference of 0.1151977 secs
#> Time difference of 0.1170812 secs
#> Time difference of 4.562075 secs
#> Time difference of 0.1262796 secs
#> Time difference of 0.1100993 secs
#> Time difference of 0.1291139 secs
#> Time difference of 8.1553147 **mins**
#> Time difference of 0.1252639 secs

Using dplyr's group_by-summarize construct, this same issue arises, and processing efficiency decreases dramatically, making e.g. averaging of duplicate values (and their error) over a 10e7 data frame temporally impossible. Note that the length of x has been reduced from 500 to 50 for the following example, but the fastest runtimes are actually slower than in the above example.

library(tibble)
library(dplyr)
library(errors)
options(dplyr.summarise.inform = FALSE)

x = 1:50
errors(x) = 0.1
grp = as.character(x)

for(i in 1:10){

    starttime = Sys.time()
    tibble(x=x, grp=grp) %>%
        group_by(grp) %>%
        summarize(x=mean(x))
    print(Sys.time() - starttime)
}
#> Time difference of 0.4369879 secs
#> Time difference of 0.1912885 secs
#> Time difference of 0.7183466 secs
#> Time difference of 0.974086 secs
#> Time difference of 1.209873 secs
#> Time difference of 1.32088 secs
#> Time difference of 1.421013 secs
#> Time difference of 3.124536 secs
#> Time difference of 0.5536084 secs
#> Time difference of 0.2924244 secs

Created on 2020-08-21 by the reprex package (v0.3.0)

Enchufa2 commented 4 years ago

Thanks. A decrease in efficiency within dplyr is expected. But the behaviour shown in your first example should definitely not happen. I have no idea why, but I suspect it has something to do with environment creation. I'll investigate.

Enchufa2 commented 4 years ago

It turns out finalizers are too expensive. So I made some internal performance improvements and now I see much better timings. Could you install the current version from GitHub and confirm this, please?

vlahm commented 4 years ago

That did the trick. Thank you!

Enchufa2 commented 4 years ago

Great! I'll roll out an update to CRAN.

Enchufa2 commented 4 years ago

On CRAN now.