markmfredrickson / optmatch

Functions for optimal matching in R
https://markmfredrickson.github.io/optmatch
Other
47 stars 14 forks source link

Is `stats::dist()`'s arithmetic diverging from that of `optmatch:::mahalanobisHelper()`? #165

Closed benthestatistician closed 5 years ago

benthestatistician commented 5 years ago

... either that, or there's something wonky about my local setup, perhaps associated w/ recent upgrade to R 3.6.0. Running tests.match_on.R I see:

 expect_true(all(abs(match_on(Z ~ X1 + X2 + B, method = "euclidean") - euclid) <
                       .00001)) # there is some rounding error, but it is small
## Error: all(...) isn't true.

Investigating a little gives:

 summary(as.numeric(euclid))
##   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.45    1.68    2.74    2.71    3.57    5.74 
 summary( as.numeric(match_on(Z ~ X1 + X2 + B, method = "euclidean") - euclid)  )
##    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.00537 -0.00042  0.00010  0.00038  0.00129  0.00632 

(A second test failure at test.match_on.R:136 is more or less the same thing.)

No indication of this failure in CRAN's checks. Locally I have a pair of commits ( f05a004 , c5efb27 ) relaxing the tolerance so that the tests pass, but I'll hold off a bit on publishing them to the master branch here on GH, in case I determine that the problem was specific to me.

benthestatistician commented 5 years ago

Seeing the same test failure on a slightly older version of R + different platform, viz.

!> sessioninfo::session_info()
 ─ Session info ───────────────────────────────────────────────────────────────
  setting  value
  version  R version 3.4.3 (2017-11-30)
  os       Red Hat Enterprise Linux Workstation 7.6 (Maipo)
  system   x86_64, linux-gnu
  ui       X11
  language (EN)
  collate  en_US.UTF-8
  ctype    en_US.UTF-8
  tz       America/New_York
  date     2019-05-28

along w/ Rcpp version 1.0.1 .

josherrickson commented 5 years ago

I can't replicate this.

> summary(as.numeric(match_on(Z ~ X1 + X2 + B, method = "euclidean")))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.5724  2.1063  3.2750  3.3909  4.3627  7.0814 
> summary(as.numeric(euclid))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.5724  2.1063  3.2750  3.3909  4.3627  7.0814 
> summary(as.numeric(match_on(Z ~ X1 + X2 + B, method = "euclidean") - euclid))
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-4.912e-07 -1.306e-08  3.954e-08  1.286e-07  2.650e-07  7.197e-07 

This is all of version on CRAN, on master, and on issue54-hinting branch. R 3.6.0.

benthestatistician commented 5 years ago

Odd. Something unusual about my setup? I see it on my desktop Mac, sessioninfo::session_info() ─ Session info ─────────────────────────────────────────────────────────────── setting value version R version 3.6.0 (2019-04-26) os macOS Mojave 10.14.5 system x86_64, darwin15.6.0 ui X11 language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/Detroit date 2019-06-05

and also on my lunix box:

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────

setting value

version R version 3.4.3 (2017-11-30)

os Red Hat Enterprise Linux Workstation 7.6 (Maipo)

system x86_64, linux-gnu

ui X11

language (EN)

collate en_US.UTF-8

ctype en_US.UTF-8

tz America/New_York

date 2019-06-05

josherrickson commented 5 years ago

Do you have a package overwriting dist? Do toy examples work?

d <- data.frame(z = c(0,1), x = rnorm(2))
library(optmatch)
all.equal(as.numeric(match_on(z ~ x, data = d, method = "euclidean")), as.numeric(dist(d$x)))
benthestatistician commented 5 years ago

No and yes:

getAnywhere('dist')$where
[1] "package:stats"   "namespace:stats"
d <- data.frame(z = c(0,1), x = rnorm(2))
library(optmatch)
all.equal(as.numeric(match_on(z ~ x, data = d, method = "euclidean")), as.numeric(dist(d$x)))
## [1] TRUE
josherrickson commented 5 years ago

I can confirm that I get the expected results on peirce.

benthestatistician commented 5 years ago

Josh made the suggestion that I try this after starting w/ R --vanilla. Findings:

  1. On both peirce (Linux box running R 3.4.3) and blackwell2 (Mac running R 3.6.0), this test passes after invoking plain vanilla R.
  2. On either machine the test does not pass after invoking R in the ordinary way, w/o --vanilla.
  3. With or without the --vanilla option, sessioninfo::session_info() returns precisely the same things, on either machine. Mac version:
─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.0 (2019-04-26)
 os       macOS Mojave 10.14.5        
 system   x86_64, darwin15.6.0        
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Detroit             
 date     2019-06-12                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version  date       lib source        
 abind         1.4-5    2016-07-21 [1] CRAN (R 3.6.0)
 assertthat    0.2.1    2019-03-21 [1] CRAN (R 3.6.0)
 cli           1.1.0    2019-03-19 [1] CRAN (R 3.6.0)
 crayon        1.3.4    2017-09-16 [1] CRAN (R 3.6.0)
 digest        0.6.18   2018-10-10 [1] CRAN (R 3.6.0)
 lattice       0.20-38  2018-11-04 [1] CRAN (R 3.6.0)
 magrittr      1.5      2014-11-22 [1] CRAN (R 3.6.0)
 Matrix        1.2-17   2019-03-22 [1] CRAN (R 3.6.0)
 optmatch    * 0.9-10   2018-07-12 [1] CRAN (R 3.6.0)
 R6            2.4.0    2019-02-14 [1] CRAN (R 3.6.0)
 Rcpp          1.0.1    2019-03-17 [1] CRAN (R 3.6.0)
 RItools       0.1-16   2018-06-19 [1] CRAN (R 3.6.0)
 rlang         0.3.4    2019-04-07 [1] CRAN (R 3.6.0)
 sessioninfo   1.1.1    2018-11-05 [1] CRAN (R 3.6.0)
 SparseM       1.77     2017-04-23 [1] CRAN (R 3.6.0)
 survival    * 2.44-1.1 2019-04-01 [1] CRAN (R 3.6.0)
 svd           0.4.3    2019-04-05 [1] CRAN (R 3.6.0)
 testthat    * 2.1.1    2019-04-23 [1] CRAN (R 3.6.0)
 withr         2.1.2    2018-03-15 [1] CRAN (R 3.6.0)
 xtable        1.8-4    2019-04-21 [1] CRAN (R 3.6.0)

[1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

(I'm not sure what this clue is telling us, but it seems a very good one. Thanks for the tip Josh!)

josherrickson commented 5 years ago

The next step I think would be to look into your .Rprofile.

benthestatistician commented 5 years ago

not much of interest in my .Rprofile, as it happens:

## Print this on start so I know it's loaded.
cat("Loading $HOME/.Rprofile ...\n")
.First <- function() {
  options(
    repos = c(CRAN = "https://cran.case.edu/"),
    digits=3L,
    deparse.max.lines = 2L)
}

...and that's it.

josherrickson commented 5 years ago

Do you save your environment and reload it when starting R? Perhaps move to a directory where you've never launched R from and start it, see if the problem persists.

It looks like --vanilla sets 4 things: --no-site-file, --no-init-file, --no-environ and --no-restore. Try each of those and see which one(s) fix it to narrow down the issue.

benthestatistician commented 5 years ago

The problem comes from my init file, specifically from the part of my .First() def setting the global "digits" option to "3L":

.First <- function() {
  options(
    repos = c(CRAN = "https://cran.case.edu/"),
    digits=3L,
    deparse.max.lines = 2L)
}

Surprisingly (to me), the digits option affects the calculations of either of stats::dist() and optmatch::match_on():


> all.equal(euclid_digits_set_to_3L, euclid_digits_not_set)

[1] "Mean relative difference: 0.676"

> all.equal(matchon_digits_set_to_3L, matchon_digits_not_set)

[1] "Mean relative difference: 0.518"

(When I put that line in my .Rprofile, I figured printing was the only thing that would be affected.)

No appetite for pinning it down any further, so I'll close out the issue after adjusting the tests a bit to avoid tripping this during development. (Unless you all complain, in which case I'll consider adjusting my .Rprofile instead.)

Thanks for another very nice troubleshooting tip, Josh!