markmfredrickson / optmatch

Functions for optimal matching in R
https://markmfredrickson.github.io/optmatch
Other
47 stars 14 forks source link

match_on.formula() problem w/ a single treatment group member #168

Closed benthestatistician closed 5 years ago

benthestatistician commented 5 years ago

Example:

 scores  <- 1:4
 z  <- c(1, rep(0,3))
 names(z)  <- names(scores)  <- letters[1:4]
(dat <- data.frame(z, scores))
##   z scores
## a 1      1
## b 0      2
## c 0      3
## d 0      4
match_on(z~ scores, data=dat)
## An object of class "DenseMatrix"
##         control
##treatment  b  c  d
##        a NA NA NA
##Slot "call":
## match_on(x = z ~ scores, data = dat)

This problem seems not to arise for match_on.numeric(), nor for match_on.formula() in cases with multiple treatment group members.

josherrickson commented 5 years ago

This is occurring because of this line inside compute_mahalanobis evaluating to NA due to sum(z) = 1.

https://github.com/markmfredrickson/optmatch/blob/4c998e30605bddee1b5604f9298be140e44579e1/R/match_on.R#L449

benthestatistician commented 5 years ago
  1. In the event that sum(z)==1, we could just do mt <- 0. Similarly if sum(!z)==1, do mc <- 0.
  2. (I suppose it's the cov(data[z, ,drop=FALSE]) that's evaluating to NA, with sum(z)-1 evaluating to 0.)
  3. This is one of those little issues that leads back to a bigger issue. Mahalanobis distance has to calculate a covariance using whatever data has been provided. In the use case bringing this bug to light, the user has chopped his data set into piece A & piece B before providing the two of them to match_on(). Given how we've got things set up, the Mahalanobis distances will have to be standardized against covariance matrix A and covariance matrix B, whereas one would have wanted to use a common covariance matrix for either calculation. I.e., one would have wanted to be able to provide an optional cov argument to match_on(), to be passed down to makedist() and then to compute_mahalanobis(). Alternatively, users expecting to be able to rbind()/cbind() together different match_on() results might be expected to provide common data= arguments in each of the calls to match_on(), specifying his different data "slices" using a subset= argument to match_on() (which does not currently exist). In this alternative the covariance would have to be calculated prior to filtering based on subset=, so I suppose the makedist() and compute_mahalanobis() function signatures would still have to be changed.

I propose to have that bigger issue be its own issue, one that we don't take on as a part of this maintenance release project. (I sense a connection to the broader topic of whether match_on() should be more insistent about receiving a data argument; cf. #134 .)