hansenlab / minfi

Devel repository for minfi
58 stars 67 forks source link

Switch from colMeans() to colMeans2() in preprocessNoob incurs numerical differences #153

Open PeteHaitch opened 6 years ago

PeteHaitch commented 6 years ago

Switching colMeans() to colMeans2() is tempting; it will be more efficient because we don't allocate the intermediate matrix.

https://github.com/hansenlab/minfi/blob/a790afc1a5c878d235f9e91fd604288c5411c443/R/preprocessNoob.R#L123-L124

https://github.com/hansenlab/minfi/blob/c6d40d432504703d1d5a0397bd5f9e2e690a57ae/R/preprocessNoob.R#L89-L94

However, the results from using colMeans() vs. colMeans2() are not numerically identical!

library(matrixStats)
set.seed(666)
x <- matrix(runif(1000, 100, 1000000), ncol = 10)
colMeans(x) - colMeans2(x)
#>  [1]  1.746230e-10  1.746230e-10  0.000000e+00  5.820766e-11  1.164153e-10
#>  [6]  5.820766e-11 -5.820766e-11 -2.910383e-10 -3.492460e-10 -1.746230e-10

Created on 2018-04-18 by the reprex package (v0.2.0).

Session info ``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R version 3.5.0 RC (2018-04-15 r74605) #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> tz America/New_York #> date 2018-04-18 #> Packages ----------------------------------------------------------------- #> package * version date source #> backports 1.1.2 2017-12-13 CRAN (R 3.5.0) #> base * 3.5.0 2018-04-16 local #> compiler 3.5.0 2018-04-16 local #> datasets * 3.5.0 2018-04-16 local #> devtools 1.13.5 2018-02-18 CRAN (R 3.5.0) #> digest 0.6.15 2018-01-28 CRAN (R 3.5.0) #> evaluate 0.10.1 2017-06-24 CRAN (R 3.5.0) #> graphics * 3.5.0 2018-04-16 local #> grDevices * 3.5.0 2018-04-16 local #> htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0) #> knitr 1.20 2018-02-20 CRAN (R 3.5.0) #> magrittr 1.5 2014-11-22 CRAN (R 3.5.0) #> matrixStats * 0.53.1 2018-02-11 CRAN (R 3.5.0) #> memoise 1.1.0 2017-04-21 CRAN (R 3.5.0) #> methods * 3.5.0 2018-04-16 local #> Rcpp 0.12.16 2018-03-13 CRAN (R 3.5.0) #> rmarkdown 1.9 2018-03-01 CRAN (R 3.5.0) #> rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0) #> stats * 3.5.0 2018-04-16 local #> stringi 1.1.7 2018-03-12 CRAN (R 3.5.0) #> stringr 1.3.0 2018-02-19 CRAN (R 3.5.0) #> tools 3.5.0 2018-04-16 local #> utils * 3.5.0 2018-04-16 local #> withr 2.1.2 2018-03-15 CRAN (R 3.5.0) #> yaml 2.1.18 2018-03-08 CRAN (R 3.5.0) ```

For further details, see https://github.com/HenrikBengtsson/matrixStats/issues/96

Consequently, switching to colMeans2() will break the digest-based unit tests.

kasperdanielhansen commented 6 years ago

We should do this, and also use the indexing (the rows and cols arguments) of colMeans2() the places it can be used. Same for colSums() and the row versions. When I switched over to matrixStats a long time ago, matrixStats did not have those functions.