r-quantities / units

Measurement units for R
https://r-quantities.github.io/units
175 stars 28 forks source link

Possibly export `ud_convert` function? #266

Closed ateucher closed 3 years ago

ateucher commented 3 years ago

I'm wondering if it's possible to export the currently unexported functionud_convert?

Use case: I have a package for working with water quality data - about 15 million rows. It contains columns of data of mixed units (unit stored in a column), many of which need to be converted to another unit (stored in a different column). Because they're mixed, I have to vapply or loop over the rows, thus calling set_units(set_units(x, unitA), unitB) many times. Calling ud_convert() directly is much faster:

library(units)
#> udunits system database from /usr/local/share/udunits
library(microbenchmark)

microbenchmark(
  units:::ud_convert(1, "mg/L", "ug/L"), 
  set_units(set_units(1, "mg/L"), "ug/L")
)
#> Unit: microseconds
#>                                     expr      min        lq       mean   median
#>    units:::ud_convert(1, "mg/L", "ug/L")   26.051   41.6465   63.16936   59.328
#>  set_units(set_units(1, "mg/L"), "ug/L") 2078.241 3441.7690 4155.80035 4327.640
#>         uq      max neval
#>    76.1755   134.01   100
#>  4775.0780 10895.10   100

A real-world example with a 2000-row sample of data:

library(units)
csv_file <- "https://gist.githubusercontent.com/ateucher/ec2caf91108639b8126a1b0349bee5a3/raw/28f7d02c41e0d1436da46f7319aa6a57065cdb1c/test.csv"

x <- read.csv(csv_file, stringsAsFactors = FALSE)

# Various combinations of units:
table(x$from_unit, x$to_unit)
#>           
#>            CFU/100mL g/m2    m m3/d mg/L  t/d ueq/L ug/g ug/m3
#>   %                0    0    0    0    2    0     0    2     0
#>   E3m3/d           0    0    0   48    0    0     0    0     0
#>   kg/d             0    0    0    0    0    7     0    0     0
#>   L/min           31    0    0    0    0    0     0    0     0
#>   mg/dm2/d         0    0    0    0   10    0     0    0     0
#>   mg/kg            0    0    0    0    0    0     0  126     0
#>   mg/L             0    0    0    0    0    0    28    2     0
#>   mg/m3            0    0    0    0    0    0     0    0     1
#>   mm               0    0    8    0    0    0     0    0     0
#>   ng/L             0    0    0    0    1    0     0    0     0
#>   ug/cm2           0    5    0    0    2    0     0    0     0
#>   ug/L             0    0    0    0 1727    0     0    0     0

# wrapper function to test two ways of converting
convert <- function(value, from, to, fun) {
  stopifnot(length(value) == length(from) && length(from) == length(to))

  vapply(seq_along(value), 
         function(i) {
           tryCatch(fun(value[i], from[i], to[i]), 
                    error = function(e) NA_real_)
         }, FUN.VALUE = double(1))
}

convert_units <- function(value, from, to) {
  ret <- units::set_units(units::set_units(value, from, mode = "standard"), to, mode = "standard")
  as.numeric(ret)
}

# using set_units
system.time(
  test1 <- convert(x$val, x$from_unit, x$to_unit, convert_units)
)
#>    user  system elapsed 
#>   7.230   0.663   8.725

# using ud_convert
system.time(
  test2 <- convert(x$val, x$from_unit, x$to_unit, units:::ud_convert)
)
#>    user  system elapsed 
#>   0.192   0.008   0.211

all.equal(test1, test2)
#> [1] TRUE

Created on 2021-02-16 by the reprex package (v1.0.0)

I was also experimenting with mixed_units(), but while I could set the mixed units, I didn't see an obvious way to convert them. But it is entirely possible I'm missing something.

ateucher commented 3 years ago

Hmmm, this is possibly a solution:

library(dplyr)

convert_units2 <- function(value, from, to) {
  stopifnot(length(unique(from)) ==1)
  stopifnot(length(unique(to)) ==1)

  ret <- tryCatch(
    units::set_units(units::set_units(value, from, mode = "standard"), to, mode = "standard"), 
    error = function(e) NA_real_
  )
  as.numeric(ret)
}

system.time(
  test3 <- x %>% 
    group_by(from_unit, to_unit) %>% 
    mutate(new_val = convert_units2(val, from_unit[1], to_unit[1]))
)
#>    user  system elapsed 
#>   0.058   0.002   0.061

all.equal(test2, test3$new_val)
#> [1] TRUE

Created on 2021-02-16 by the reprex package (v1.0.0)

Enchufa2 commented 3 years ago

set_units provides a method for mixed_units, so you don't need to loop over units:

from <- c("m/s", "km/h", "mg/L", "g")
to <- c("km/h", "m/s", "g/L", "kg")
x <- mixed_units(1:4, from)
set_units(x, to, mode="standard")
#> Mixed units: g/L (1), kg (1), km/h (1), m/s (1) 
#> 3.6 [km/h], 0.5555556 [m/s], 0.003 [g/L], 0.004 [kg]
Enchufa2 commented 3 years ago

For reference, this would be the way to do it with mixed_units:

x <- read.csv(csv_file, stringsAsFactors = FALSE)
x <- x[mapply(ud_are_convertible, x$to_unit, x$from_unit), ]
x$val <- with(x, mixed_units(val, from_unit))
x$new_val <- with(x, set_units(val, to_unit, mode="standard"))

I just dropped non-convertible units. Yours is a good approach too, faster in this case because many units are repeated. Thus, we could probably take this idea of grouping into the internals of set_units.mixed_units to improve its performance.

Anyway, in answer to the original request, I don't think that exporting ud_convert is a good idea, because it entirely defeats the purpose of the package.

ateucher commented 3 years ago

This is brilliant, thanks @Enchufa2. I had missed the mixed_units method for set_units - and I do recognize that this request was outside the intent of the package. I think I'll go with the grouped approach.

ud_are_convertible is also not exported... would you consider exporting that, or a variant that fits with the package philosophy?

Thanks for the quick response

Enchufa2 commented 3 years ago

I took a look at the thread referenced above and I do think your grouped approach is the best for your use case.

ud_are_convertible is exported in the devel version.

ateucher commented 3 years ago

I think so too. Thanks so much for the thorough responses.

ateucher commented 3 years ago

@Enchufa2 sorry to pester you again on this. Do you have an idea of when the next version (with ud_are_convertible exported) will be heading to CRAN?

edzer commented 3 years ago

(ready when you are, @Enchufa2 )

Enchufa2 commented 3 years ago

I would like to give another pass to the list of issues and address some of them before going for another release.