markmfredrickson / optmatch

Functions for optimal matching in R
https://markmfredrickson.github.io/optmatch
Other
47 stars 14 forks source link

factor misunderstood to be a numeric in match_on.formula #220

Closed benthestatistician closed 2 years ago

benthestatistician commented 2 years ago

@ngreifer reports:

When a factor variable is supplied as a covariate to match_on(), it seems to be converted to a numeric variable before the Mahalanobis distance is computed. This seems to be undesirable behavior unless the variable is an ordered factor, but it happens with unordered factors, too. Using the compute_smahal() function from your tests:

data("lalonde", package = "MatchIt")

#Using factor race
form1 <- treat ~ age + race
X1 <- model.matrix(form1, data = lalonde)[,-1]

d1a <- compute_smahal(lalonde$treat, X1)
d1b <- as.matrix(optmatch::match_on(form1, data = lalonde,
                                    method = "rank_mahalanobis"))
all.equal(d1a, d1b,
          check.attributes = FALSE)
#> [1] "Mean relative difference: 0.03273117"

#Using numeric race
form2 <- treat ~ age + as.numeric(race)
X2 <- model.matrix(form2, data = lalonde)[,-1]

d2a <- compute_smahal(lalonde$treat, X2)
d2b <- as.matrix(optmatch::match_on(form2, data = lalonde,
                                    method = "rank_mahalanobis"))
all.equal(d2a, d2b,
          check.attributes = FALSE)
#> [1] TRUE

Created on 2022-05-16 by the reprex package (v2.0.1)

Originally posted by @ngreifer in https://github.com/markmfredrickson/optmatch/issues/218#issuecomment-1128151151

josherrickson commented 2 years ago

Thanks again @ngreifer, for the report. While I agree something here appears wacky, I don't think this is as straightforward as factor being converted to numeric. From your example,

> d1a[1:3, 1:3]
       186      187      188
1 1.938646 2.132027 2.216650
2 2.297230 2.148774 2.121315
3 1.860987 1.969417 2.030250
> d1b[1:3, 1:3]
         control
treatment    PSID1    PSID2    PSID3
     NSW1 2.023665 2.211544 2.293857
     NSW2 2.186393 2.023600 1.992289
     NSW3 1.947646 2.053576 2.112662
> d2a[1:3, 1:3]
       186      187      188
1 2.021065 2.209886 2.292463
2 1.453701 1.171462 1.108072
3 1.944184 2.051181 2.110593
> d2b[1:3, 1:3]
         control
treatment    PSID1    PSID2    PSID3
     NSW1 2.021065 2.209886 2.292463
     NSW2 1.453701 1.171462 1.108072
     NSW3 1.944184 2.051181 2.110593

If we were simply casting race to numeric, I would expect the d1b results to be the same as d2a and d2b.

I looked into it a bit. One oddity is that at some point it appears race gets treated as ordinal, which it isn't.

I'll work further on this when I get a chance.

josherrickson commented 2 years ago

We can get identical results by passing the dummy variables manually:

> data("lalonde", package = "MatchIt")
> lalonde$racehisp <- lalonde$race == "hispan"
> lalonde$racewhite <- lalonde$race == "white"
> form1 <- treat ~ age + race
> X1 <- model.matrix(form1, data = lalonde)[,-1]
> d1a <- compute_smahal(lalonde$treat, X1)
> d1a[1:3, 1:3]
       186      187      188
1 1.938646 2.132027 2.216650
2 2.297230 2.148774 2.121315
3 1.860987 1.969417 2.030250
> d1b <- match_on(treat ~ age + racehisp + racewhite, data = lalonde, method = "rank_mahalanobis")
> d1b[1:3, 1:3]
         control
treatment    PSID1    PSID2    PSID3
     NSW1 1.938646 2.132027 2.216650
     NSW2 2.297230 2.148774 2.121315
     NSW3 1.860987 1.969417 2.030250

This stems from a decision made in #80, by passing factor variables to contr.poly we obtain possibly unexpected results.

Should we be doing this contrast coding on non-euclidean distances? Or should this be documented? Or an option?

benthestatistician commented 2 years ago

My take-aways -- pls correct if I'm mistaken --

  1. Turns out not to be the case that factors are unexpectedly being converted to numerics.
  2. It appears not to be the case that they're being converted to ordinals, either.
  3. It seemed that way because (regular) factors are being expanded into numeric features using a variant of contr.poly, which would ordinarily be applied (by model.matrix.default()) only to ordinals.
  4. This in turn is because at the time @markmfredrickson and I were working on #80, contr.poly() seemed to give results that better matched what the user would likely have wanted than did the stock contrasts function for unordered factors, contr.treatment(). In particular, per Mark's comment at the top of that issue, with contr.treatment() it was not the case that you get 0's for pairs of observations in the same category, but positive values otherwise.
  5. The computations struck @ngreifer as mistaken because they didn't match what he got by expanding the race factor in the usual model.matrix() way. But it turns out we weren't trying to match those results anyway -- they work poorly in general for matching distances, as Mark noted at the top of #80.

If there's a problem here, it would have to be one or both of:

a. the contr.poly() variant we're using gives answers that are inappropriate for factors in some other way (that the inappropriateness we fixed in #80). b. the contr.poly() variant we're using is poorly documented and/or confusing.

I'm willing to believe either, both or neither, but based on what's here currently I think it's probably (b) (and (b) only). I don't recall clearly what contr.poly() is doing, if I ever understood it; I'd be quite willing to believe that there's a better and simpler solution that we didn't explore at the time only because we had restricted ourselves to the contr.* functions already available in base R.

josherrickson commented 2 years ago

All five points are correct, Ben. I think we're definitely (b), and possibly (a). I see nothing in the documentation about factor variables being treated in any particular way; it seems to me that @ngreifer's assumption of how it would work is the most reasonable assumption for someone looking at the documentation to make. Mostly since that's the way factor variables are treated in formulas in almost all other situations in R.

I see our possible recourse as some combination of:

  1. Update documentation (in both match_on and fullmatch) to clarify how factor variables are treated (and perhaps refer users to exactMatch and antiExactMatch).
  2. A warning or message if users include a factor variable in match_on.formula pointing users to updated documentation.
  3. An additional argument to match_on to define handling of factor variables. (Specifically thinking of passing one of the contr.* functions or a function which mimics the contr.* signature.)

1 & 2 are easy fixes, perhaps we can do those and punt 3 down the road a bit until we can clarify how we want the defaults to behave.

ngreifer commented 2 years ago

A fix I use in matchit() is the following, which aims to do the same thing as contr.match_on():

model.matrix(formula, data = mf,
             contrasts.arg = lapply(Filter(is.factor, mf),
                                    function(x) contrasts(x, contrasts = FALSE)/sqrt(2)))

The has the effect of keeping all levels of a factor (i.e., not dropping the first level), and performing the addition 1/sqrt(2) scaling that is necessary for the contribution to the Euclidean distance between two units with different factor levels to be 1. I think this accomplishes what contr.match_on() aims to do but it might be a little more transparent since it is a bit unclear why you would want a polynomial contrast for a nominal factor. I also make sure all character variables are recoded as factors so they are correctly picked up by the Filter().

That said, I don't think this should affect results for Mahalanobis or robust Mahalanobis, because both methods are invariant to multiplicative scaling of any of the variables.

benthestatistician commented 2 years ago

Noah's simple solution definitely has appeal. I take it we're welcome to "borrow" it? Either way, thanks much for suggestion!

ngreifer commented 2 years ago

You are welcome to take, no need for attribution either.

josherrickson commented 2 years ago

Pushed up a fix. Passed existing tests, and added a new test explicitly for this scenario. Thanks @ngreifer.

benthestatistician commented 2 years ago

Thanks very much, Josh and Noah!

Re attribution, I didn't see a good place to note the contribution on a man page, so I was going to suggest an acknowledgment to Noah in the package NEWS. But then I saw Josh had beat me to that -- thanks, Josh. Anyway Noah, if you keep racking up great suggestions to us then we'll be forced to add you as a ctb. You've been warned.

benthestatistician commented 1 year ago

@ngreifer is it the case that you have additional suggestions for us with regard to factor handling, over and above those you conveyed last year (as discussed in this issue)? We incorporated one suggestion of yours into our match_on.formula(), here; if we missed something we're welcome to re-open the issue. Also welcome to additional, new suggestions on the same theme. I say this after @josherrickson shared an email comment from you as follows:

The MatchIt and optmatch computations for the Mahalanobis are exactly the same except for how factors are handled. I wanted for MatchIt to produce the same distance matrix whether the factor is supplied, the full set of dummies is supplied, or all but one category are supplied as dummies. In optmatch, it looks like you get slightly different results if the full set of dummies is included vs the factor variable itself, but quite different results if you include all levels vs. all but one level, and different results if you include the factor vs. all but one level.

I had thought that taking your earlier suggestion would bring the two packages into alignment on points like these, but perhaps I was mistaken?

ngreifer commented 1 year ago

Hi @benthestatistician, I believe our previous discussion was just about the Euclidean distance and how factor dummies were extracted (i.e., with an additional scaling factor). This discussion I had with Josh is about computing the Mahalanobis distance when categorical variables are involved, which doesn't involve the scaling we discussed above because the Mahalanobis distance is invariant to scaling. This issue is about computing and inverting the covariance matrix of the dataset once factor variables have been split into dummies.

My observation was that several ways of including a factor variable in the Mahalanobis distance formula (manually including all dummies, including all but one dummy, or supplying a factor that is split internally by match_on()) do not give the same results even though it might seem they should, given that all parameterizations contain the same information and therefore should yield the same distances between units.

We discovered this because I was having an issue where the non-full rank covariance matrix that included all dummies was being inappropriately inverted on M1 Macs, which caused a problem when the Mahalanobis distance computed was passed to fullmatch(). The reason optmatch doesn't have this problem is that factor variables don't seem to be fully expanded, so the covariance matrix is full rank and no problem occurs. My usual solution in MatchIt is to compute a generalized inverse, but this wasn't being triggered because the trigger was if there was an error in computing the usual inverse, which was not occurring on M1 Macs. I'm still investigating this problem.

Here is a demonstration of the inconsistency I'm seeing in optmatch:

library(MatchIt); library(optmatch)
data("lalonde", package = "MatchIt")
lalonde <- cobalt::splitfactor(lalonde, drop.first = FALSE,
                               replace = FALSE)

head(lalonde)
#>      treat age educ   race married nodegree re74 re75       re78 race_black
#> NSW1     1  37   11  black       1        1    0    0  9930.0460          1
#> NSW2     1  22    9 hispan       0        1    0    0  3595.8940          0
#> NSW3     1  30   12  black       0        0    0    0 24909.4500          1
#> NSW4     1  27   11  black       0        1    0    0  7506.1460          1
#> NSW5     1  33    8  black       0        1    0    0   289.7899          1
#> NSW6     1  22    9  black       0        1    0    0  4056.4940          1
#>      race_hispan race_white
#> NSW1           0          0
#> NSW2           1          0
#> NSW3           0          0
#> NSW4           0          0
#> NSW5           0          0
#> NSW6           0          0

form1 <- treat ~ age + educ + married + nodegree + re74 + re75 +
  race
form2 <- treat ~ age + educ + married + nodegree + re74 + re75 +
  race_black + race_hispan + race_white
form3 <- treat ~ age + educ + married + nodegree + re74 + re75 +
  race_black + race_hispan

# Using MatchIt:
m1 <- mahalanobis_dist(form1, lalonde)
m2 <- mahalanobis_dist(form2, lalonde)
m3 <- mahalanobis_dist(form3, lalonde)

all.equal(m1, m2)
#> [1] TRUE
all.equal(m1, m3)
#> [1] TRUE

# Using optmatch
o1 <- as.matrix(match_on(form1, data = lalonde))
o2 <- as.matrix(match_on(form2, data = lalonde))
o3 <- as.matrix(match_on(form3, data = lalonde))

all.equal(o1, o2, check.attributes = FALSE)
#> [1] "Mean relative difference: 0.000753521"
all.equal(o1, o3, check.attributes = FALSE)
#> [1] "Mean relative difference: 0.6562025"

all.equal(m1, o1, check.attributes = FALSE)
#> [1] "Mean relative difference: 0.3962091"
all.equal(m2, o2, check.attributes = FALSE)
#> [1] "Mean relative difference: 0.3962472"
all.equal(m3, o3, check.attributes = FALSE)
#> [1] TRUE

I don't know why o1 and o2 would differ at all, though the difference is slight. Perhaps this is related to numerical imprecision due to the scaling we discussed earlier in this issue. More concerning to me is the difference between o1 and o3 and the difference between o1 and m1.

benthestatistician commented 1 year ago

Thanks for the informative post, @ngreifer .

Optmatch does its factor expansion the same way for Euclidean and for Mahalanobis distance. In the manner that we took matchIt to be doing its factor expansion, following this issue. That was like so:

model.matrix(x, mf,
                              contrasts.arg =
                                lapply(Filter(is.factor, mf),
                                       function(x) {
                                         stats::contrasts(x, contrasts = FALSE)/sqrt(2)
                                       }))

I take it from what you're saying now that matchIt may be doing this when asked for a Euclidean distance, but it does something different for Mahalanobis? I'm not sure I'd want to introduce a branching step here, but it would be useful to know if this is indeed where our answers diverge.

(BTW we to first try to solve() a covariance matrix, then revert to a generalized inverse if that can't be done. The covariance we try to solve is a pooled covariance not the raw covariance. This stuff happens subsequent to factor expansion as above, inside of compute_mahanalobis().)

ngreifer commented 1 year ago

Everything you described is consistent with how MatchIt does it. I just wanted to alert you to a discrepancy that you might want to look into. I'm not sure what the cause of it is.

benthestatistician commented 1 year ago

Thanks, Noah! Ben

benthestatistician commented 1 year ago

I briefly looked into Noah's helpful example. Something that strikes me about it is that the model matrix based on Lalonde + form2 is much more singular than I had expected, with less than half of the singular values numerically nonzero:

table(nz)
#> nz
#> FALSE  TRUE 
#>    5     4 

This also happens if the covariance is calculated without pooling.

On the other hand, if we generalized-invert the correlation instead of the covariance matrix, we do get the expected number of numerically nonzero singular values (all but one).

corr_u  <- cor(data)
s_corr_u  <- svd(corr_u)
nz_corr_u <- (s_corr_u$d > sqrt(.Machine$double.eps) * s_corr_u$d[1])
table(nz_corr_u)
#> nz_corr_u
#> FALSE  TRUE 
#>     1     8 
##' Starting from the pooled covariance:
corr_p  <- cv / outer(sqrt(diag(cv)), sqrt(diag(cv)))
s_corr_p  <- svd(corr_p)
nz_corr_p <- (s_corr_p$d > sqrt(.Machine$double.eps) * s_corr_p$d[1])
table(nz_corr_p)
#> nz_corr_p
#> FALSE  TRUE 
#>     1     8 

So I conjecture our Mahalanobis results would be more like his in this case, and more in line with the general user expectations he's modeling, if we standardized the columns of the data matrix before calculating the covariance and its (generalized) inverse.

To consider: if we go this route, should the standardization use means and SDs or robust alternatives, such as trimmed means and MADs? #134

ngreifer commented 1 year ago

That's a good observation. I think the extreme difference in the scales of the predictors can cause numerical problems like these.