r-quantities / units

Measurement units for R
https://r-quantities.github.io/units
175 stars 28 forks source link

Performance issue #251

Closed bevingtona closed 3 years ago

bevingtona commented 3 years ago

Really love this package but did not realize the bottleneck it creates in some my sf workflows. Converting an area is >4000 times slower than base R. Any suggestions to improve the performance? See reprex of conversion from 5000 m^2 to km^2 below: Thanks in advance!

library(units)

microbenchmark::microbenchmark(5000/(1000*1000),
                               set_units(as_units(5000,"m2"),"km2"))
#> Unit: nanoseconds
#>                                    expr     min      lq    mean  median      uq
#>                      5000/(1000 * 1000)     200     300     604     650     800
#>  set_units(as_units(5000, "m2"), "km2") 2494900 2540000 2729740 2753700 2788350
#>      max neval cld
#>     5600   100  a 
#>  5385700   100   b
library(units)
#> udunits system database from E:/Dropbox/R/R-3.6.3/library/units/share/udunits

m <- as_units(5000,"m2")

microbenchmark::microbenchmark(5000 /(1000*1000),
                               set_units(m,"km2"))
#> Unit: nanoseconds
#>                 expr     min      lq    mean  median      uq     max neval cld
#>   5000/(1000 * 1000)     100     200     587     300     800    7100   100  a 
#>  set_units(m, "km2") 1480300 1625600 1678135 1646850 1684550 3071600   100   b

Created on 2020-09-15 by the reprex package (v0.3.0)

Enchufa2 commented 3 years ago

Do you have a toy example of a real workflow with sf?

Enchufa2 commented 3 years ago

Part of the problem is the try call for character parsing:

library(units)
#> udunits system database from /usr/share/udunits

microbenchmark::microbenchmark(
  5000 /(1000*1000),
  set_units(1, "m"),
  set_units(1, m)
)
#> Unit: nanoseconds
#>                expr    min       lq      mean   median       uq    max neval
#>  5000/(1000 * 1000)    156    252.5    556.56    564.5    617.5   3697   100
#>   set_units(1, "m") 338754 369774.0 400744.32 386060.5 422012.0 836869   100
#>     set_units(1, m) 150271 164168.5 190829.26 178997.5 193420.5 783045   100
#>  cld
#>  a  
#>    c
#>   b

microbenchmark::microbenchmark(
  5000 /(1000*1000),
  set_units(as_units("m"), "km"),
  set_units(set_units(1, m), km)
)
#> Unit: nanoseconds
#>                            expr    min        lq       mean    median        uq
#>              5000/(1000 * 1000)    139     576.0    1021.26     698.5     813.5
#>  set_units(as_units("m"), "km") 975142 1036543.0 1170724.34 1134428.0 1184959.0
#>  set_units(set_units(1, m), km) 642961  742960.5  800879.32  767404.0  806938.5
#>      max neval cld
#>    15689   100 a  
#>  2537073   100   c
#>  1832808   100  b
edzer commented 3 years ago

Please come with a use case where the time difference is meaningful: what will you do with those nanoseconds? Typically you operate on vectors of units, like

library(units)
# udunits system database from /usr/share/xml/udunits
microbenchmark::microbenchmark((1:1000000)/(1000*1000),
                               set_units(as_units(1:1000000,"m2"),"km2"))
# Unit: milliseconds
#                                       expr      min       lq     mean    median
#                    (1:1e+06)/(1000 * 1000) 1.680127 2.810617  4.73138  5.241931
#  set_units(as_units(1:1e+06, "m2"), "km2") 7.370606 8.817785 11.76610 12.162606
#         uq      max neval
#   5.584564  9.74689   100
#  12.914351 31.42322   100

There is still a difference, but when will this be the bottleneck in your analysis? And then there's always drop_units.

bevingtona commented 3 years ago

Thanks to both of you :)

The use case for me is that I have 1.5 million polygons and I would like to calculate the area of each, and the area of the polygon with a 30 metre and a -30 metre buffer, and then convert the each of the 3 areas from m^2 to km^2. I am just looking at ways to speed it up the script.

Thanks for your insights

Here is an example using sf in milliseconds ;)

library(units)
#> udunits system database from E:/Dropbox/R/R-3.6.3/library/units/share/udunits
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

file_name <- system.file("shape/nc.shp", package="sf")

(nc <- st_read(file_name) %>% select(geometry) %>% 
    mutate(area_m2_with_units = st_area(.),
           area_m2_no_units = st_area(.) %>% drop_units()))
#> Reading layer `nc' from data source `E:\Dropbox\R\R-3.6.3\library\sf\shape\nc.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> geographic CRS: NAD27
#> Simple feature collection with 100 features and 2 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> geographic CRS: NAD27
#> First 10 features:
#>                          geometry area_m2_with_units area_m2_no_units
#> 1  MULTIPOLYGON (((-81.47276 3...   1137388604 [m^2]       1137388604
#> 2  MULTIPOLYGON (((-81.23989 3...    611077263 [m^2]        611077263
#> 3  MULTIPOLYGON (((-80.45634 3...   1423489919 [m^2]       1423489919
#> 4  MULTIPOLYGON (((-76.00897 3...    694546292 [m^2]        694546292
#> 5  MULTIPOLYGON (((-77.21767 3...   1520740530 [m^2]       1520740530
#> 6  MULTIPOLYGON (((-76.74506 3...    967727952 [m^2]        967727952
#> 7  MULTIPOLYGON (((-76.00897 3...    615942210 [m^2]        615942210
#> 8  MULTIPOLYGON (((-76.56251 3...    903650119 [m^2]        903650119
#> 9  MULTIPOLYGON (((-78.30876 3...   1179347051 [m^2]       1179347051
#> 10 MULTIPOLYGON (((-80.02567 3...   1232769242 [m^2]       1232769242

microbenchmark::microbenchmark(
  nc %>% mutate(area_km2_with_units = set_units(area_m2_with_units, "km2")),
  nc %>% mutate(area_km2_no_units = area_m2_no_units/(1000*1000)))
#> Unit: milliseconds
#>                                                                            expr
#>  nc %>% mutate(area_km2_with_units = set_units(area_m2_with_units,      "km2"))
#>               nc %>% mutate(area_km2_no_units = area_m2_no_units/(1000 * 1000))
#>     min      lq     mean median      uq    max neval cld
#>  3.0232 3.10985 3.351746 3.2059 3.34610 6.2752   100   b
#>  1.4575 1.51945 1.633360 1.5740 1.64825 4.0995   100  a

Created on 2020-09-15 by the reprex package (v0.3.0)

edzer commented 3 years ago

I would be surprised if the unit conversion is not orders of magnitude cheaper than buffer and area calculation. Let me know if it is so significant. And if you have many unit values, put them in vectors before you convert them.

bevingtona commented 3 years ago

thanks! I'll close the issue.