rmaia / pavo

tools for the analysis of color data in R
http://pavo.colrverse.com
GNU General Public License v2.0
69 stars 17 forks source link

summary.rspec rewrite #250

Closed thomased closed 1 year ago

thomased commented 1 year ago

summary.rspec rejig (fixes #165)

thomased commented 1 year ago

Current state

Slower when computing full suite of variables (as expected, owing to duplication of calcs)

library(microbenchmark)

data(sicalis)

microbenchmark(
  summary(sicalis),
  summary2.rspec(sicalis)
)
Unit: milliseconds
                    expr       min        lq      mean   median        uq       max neval cld
        summary(sicalis)  8.952917  9.063292  9.882924  9.18223  9.353521  17.08933   100  a 
 summary2.rspec(sicalis) 30.926543 31.235022 34.230939 31.66886 37.546584 117.12217   100   b
> 

Faster when computing subset (as expected, since that's all it's calculating)

microbenchmark(
  summary(sicalis, subset = 'S2'),
  summary2.rspec(sicalis, subset = 'S2')
)
Unit: microseconds
                                   expr      min        lq      mean   median        uq       max neval cld
        summary(sicalis, subset = "S2") 8895.251 9075.2505 9830.7253 9189.292 9503.1675 16579.334   100  a 
 summary2.rspec(sicalis, subset = "S2")  591.917  628.5425  659.8965  652.376  678.9175   811.125   100   b

Some further refactoring can probably bring down that difference for the full set

thomased commented 1 year ago

Alright, I think I've gone as far as I can & have the patience for at the moment. So it now only calculates the specified variable(s) (and any dependent variables), which I think is good news just from a design perspective. E.g. it shouldn't matter if you don't have the wavelength range for 'green chroma' if you just want B3.

It also means some changes to its speed. It's now slower when calculating all variables — about ~half the speed:

> microbenchmark(
+   summary(sicalis),
+   summary_old.rspec(sicalis)
+ )
Unit: milliseconds
                       expr       min        lq     mean    median        uq      max neval cld
           summary(sicalis) 20.536917 20.844168 24.35726 21.463855 23.064938 149.1463   100  a 
 summary_old.rspec(sicalis)  8.915709  9.040937 10.95369  9.219043  9.444459 120.3357   100   b

But ~10x faster when calculating only a subset:

> microbenchmark(
+   summary(sicalis, subset = 'S2'),
+   summary_old.rspec(sicalis, subset = 'S2')
+ )
Unit: microseconds
                                      expr      min        lq     mean   median        uq       max neval cld
           summary(sicalis, subset = "S2")  768.876  807.7095  848.439  838.730  866.2295  1122.376   100  a 
 summary_old.rspec(sicalis, subset = "S2") 8912.167 9070.2920 9940.778 9255.084 9542.8550 18332.750   100   b

I didn't think it'd be quite so slow for the full-variable-set use case, but I might be missing something obvious. I'm sure there's more tweaking & tightening which can be done - feel free to shout if anything obvious leaps out at you.

thomased commented 1 year ago

Absolutely — feel free to push/tweak anything. Thanks!

Bisaloo commented 1 year ago

I had another, in-depth, look and I'm now not sure more comments are needed. I've pushed a minor update but I think it already looks very good as it is :+1:.

One problematic edge case I've found while trying to add deprecation tests:

summary(sicalis, wlmin = 400)

Error in if (lambdamax < lim[2]) { : missing value where TRUE/FALSE needed In addition: Warning message: In summary.rspec(sicalis, wlmin = 400) : The 'wlmin' and 'wlmax' arguments are deprecated as of v2.9.0, and will be removed in future releases. Use the lim argument instead.

thomased commented 1 year ago

Thanks! Should be fixed with that next commit.

I'll merge this in shortly then, and start prepping a release soon too.