Closed benthestatistician closed 2 years ago
Thanks again @ngreifer, for the report. While I agree something here appears wacky, I don't think this is as straightforward as factor being converted to numeric. From your example,
> d1a[1:3, 1:3]
186 187 188
1 1.938646 2.132027 2.216650
2 2.297230 2.148774 2.121315
3 1.860987 1.969417 2.030250
> d1b[1:3, 1:3]
control
treatment PSID1 PSID2 PSID3
NSW1 2.023665 2.211544 2.293857
NSW2 2.186393 2.023600 1.992289
NSW3 1.947646 2.053576 2.112662
> d2a[1:3, 1:3]
186 187 188
1 2.021065 2.209886 2.292463
2 1.453701 1.171462 1.108072
3 1.944184 2.051181 2.110593
> d2b[1:3, 1:3]
control
treatment PSID1 PSID2 PSID3
NSW1 2.021065 2.209886 2.292463
NSW2 1.453701 1.171462 1.108072
NSW3 1.944184 2.051181 2.110593
If we were simply casting race
to numeric, I would expect the d1b
results to be the same as d2a
and d2b
.
I looked into it a bit. One oddity is that at some point it appears race
gets treated as ordinal, which it isn't.
I'll work further on this when I get a chance.
We can get identical results by passing the dummy variables manually:
> data("lalonde", package = "MatchIt")
> lalonde$racehisp <- lalonde$race == "hispan"
> lalonde$racewhite <- lalonde$race == "white"
> form1 <- treat ~ age + race
> X1 <- model.matrix(form1, data = lalonde)[,-1]
> d1a <- compute_smahal(lalonde$treat, X1)
> d1a[1:3, 1:3]
186 187 188
1 1.938646 2.132027 2.216650
2 2.297230 2.148774 2.121315
3 1.860987 1.969417 2.030250
> d1b <- match_on(treat ~ age + racehisp + racewhite, data = lalonde, method = "rank_mahalanobis")
> d1b[1:3, 1:3]
control
treatment PSID1 PSID2 PSID3
NSW1 1.938646 2.132027 2.216650
NSW2 2.297230 2.148774 2.121315
NSW3 1.860987 1.969417 2.030250
This stems from a decision made in #80, by passing factor variables to contr.poly
we obtain possibly unexpected results.
Should we be doing this contrast coding on non-euclidean distances? Or should this be documented? Or an option?
My take-aways -- pls correct if I'm mistaken --
contr.poly
, which would ordinarily be applied (by model.matrix.default()
) only to ordinals. contr.poly()
seemed to give results that better matched what the user would likely have wanted than did the stock contrasts function for unordered factors, contr.treatment()
. In particular, per Mark's comment at the top of that issue, with contr.treatment()
it was not the case that you get 0's for pairs of observations in the same category, but positive values otherwise.race
factor in the usual model.matrix()
way. But it turns out we weren't trying to match those results anyway -- they work poorly in general for matching distances, as Mark noted at the top of #80.If there's a problem here, it would have to be one or both of:
a. the contr.poly()
variant we're using gives answers that are inappropriate for factors in some other way (that the inappropriateness we fixed in #80).
b. the contr.poly()
variant we're using is poorly documented and/or confusing.
I'm willing to believe either, both or neither, but based on what's here currently I think it's probably (b) (and (b) only). I don't recall clearly what contr.poly()
is doing, if I ever understood it; I'd be quite willing to believe that there's a better and simpler solution that we didn't explore at the time only because we had restricted ourselves to the contr.*
functions already available in base R.
All five points are correct, Ben. I think we're definitely (b), and possibly (a). I see nothing in the documentation about factor variables being treated in any particular way; it seems to me that @ngreifer's assumption of how it would work is the most reasonable assumption for someone looking at the documentation to make. Mostly since that's the way factor variables are treated in formulas in almost all other situations in R.
I see our possible recourse as some combination of:
match_on
and fullmatch
) to clarify how factor variables are treated (and perhaps refer users to exactMatch
and antiExactMatch
).match_on.formula
pointing users to updated documentation.match_on
to define handling of factor variables. (Specifically thinking of passing one of the contr.*
functions or a function which mimics the contr.*
signature.)1 & 2 are easy fixes, perhaps we can do those and punt 3 down the road a bit until we can clarify how we want the defaults to behave.
A fix I use in matchit()
is the following, which aims to do the same thing as contr.match_on()
:
model.matrix(formula, data = mf,
contrasts.arg = lapply(Filter(is.factor, mf),
function(x) contrasts(x, contrasts = FALSE)/sqrt(2)))
The has the effect of keeping all levels of a factor (i.e., not dropping the first level), and performing the addition 1/sqrt(2)
scaling that is necessary for the contribution to the Euclidean distance between two units with different factor levels to be 1. I think this accomplishes what contr.match_on()
aims to do but it might be a little more transparent since it is a bit unclear why you would want a polynomial contrast for a nominal factor. I also make sure all character variables are recoded as factors so they are correctly picked up by the Filter()
.
That said, I don't think this should affect results for Mahalanobis or robust Mahalanobis, because both methods are invariant to multiplicative scaling of any of the variables.
Noah's simple solution definitely has appeal. I take it we're welcome to "borrow" it? Either way, thanks much for suggestion!
You are welcome to take, no need for attribution either.
Pushed up a fix. Passed existing tests, and added a new test explicitly for this scenario. Thanks @ngreifer.
Thanks very much, Josh and Noah!
Re attribution, I didn't see a good place to note the contribution on a man page, so I was going to suggest an acknowledgment to Noah in the package NEWS. But then I saw Josh had beat me to that -- thanks, Josh. Anyway Noah, if you keep racking up great suggestions to us then we'll be forced to add you as a ctb
. You've been warned.
@ngreifer is it the case that you have additional suggestions for us with regard to factor handling, over and above those you conveyed last year (as discussed in this issue)? We incorporated one suggestion of yours into our match_on.formula()
, here; if we missed something we're welcome to re-open the issue. Also welcome to additional, new suggestions on the same theme. I say this after @josherrickson shared an email comment from you as follows:
The MatchIt and optmatch computations for the Mahalanobis are exactly the same except for how factors are handled. I wanted for MatchIt to produce the same distance matrix whether the factor is supplied, the full set of dummies is supplied, or all but one category are supplied as dummies. In optmatch, it looks like you get slightly different results if the full set of dummies is included vs the factor variable itself, but quite different results if you include all levels vs. all but one level, and different results if you include the factor vs. all but one level.
I had thought that taking your earlier suggestion would bring the two packages into alignment on points like these, but perhaps I was mistaken?
Hi @benthestatistician, I believe our previous discussion was just about the Euclidean distance and how factor dummies were extracted (i.e., with an additional scaling factor). This discussion I had with Josh is about computing the Mahalanobis distance when categorical variables are involved, which doesn't involve the scaling we discussed above because the Mahalanobis distance is invariant to scaling. This issue is about computing and inverting the covariance matrix of the dataset once factor variables have been split into dummies.
My observation was that several ways of including a factor variable in the Mahalanobis distance formula (manually including all dummies, including all but one dummy, or supplying a factor that is split internally by match_on()
) do not give the same results even though it might seem they should, given that all parameterizations contain the same information and therefore should yield the same distances between units.
We discovered this because I was having an issue where the non-full rank covariance matrix that included all dummies was being inappropriately inverted on M1 Macs, which caused a problem when the Mahalanobis distance computed was passed to fullmatch()
. The reason optmatch
doesn't have this problem is that factor variables don't seem to be fully expanded, so the covariance matrix is full rank and no problem occurs. My usual solution in MatchIt
is to compute a generalized inverse, but this wasn't being triggered because the trigger was if there was an error in computing the usual inverse, which was not occurring on M1 Macs. I'm still investigating this problem.
Here is a demonstration of the inconsistency I'm seeing in optmatch
:
library(MatchIt); library(optmatch)
data("lalonde", package = "MatchIt")
lalonde <- cobalt::splitfactor(lalonde, drop.first = FALSE,
replace = FALSE)
head(lalonde)
#> treat age educ race married nodegree re74 re75 re78 race_black
#> NSW1 1 37 11 black 1 1 0 0 9930.0460 1
#> NSW2 1 22 9 hispan 0 1 0 0 3595.8940 0
#> NSW3 1 30 12 black 0 0 0 0 24909.4500 1
#> NSW4 1 27 11 black 0 1 0 0 7506.1460 1
#> NSW5 1 33 8 black 0 1 0 0 289.7899 1
#> NSW6 1 22 9 black 0 1 0 0 4056.4940 1
#> race_hispan race_white
#> NSW1 0 0
#> NSW2 1 0
#> NSW3 0 0
#> NSW4 0 0
#> NSW5 0 0
#> NSW6 0 0
form1 <- treat ~ age + educ + married + nodegree + re74 + re75 +
race
form2 <- treat ~ age + educ + married + nodegree + re74 + re75 +
race_black + race_hispan + race_white
form3 <- treat ~ age + educ + married + nodegree + re74 + re75 +
race_black + race_hispan
# Using MatchIt:
m1 <- mahalanobis_dist(form1, lalonde)
m2 <- mahalanobis_dist(form2, lalonde)
m3 <- mahalanobis_dist(form3, lalonde)
all.equal(m1, m2)
#> [1] TRUE
all.equal(m1, m3)
#> [1] TRUE
# Using optmatch
o1 <- as.matrix(match_on(form1, data = lalonde))
o2 <- as.matrix(match_on(form2, data = lalonde))
o3 <- as.matrix(match_on(form3, data = lalonde))
all.equal(o1, o2, check.attributes = FALSE)
#> [1] "Mean relative difference: 0.000753521"
all.equal(o1, o3, check.attributes = FALSE)
#> [1] "Mean relative difference: 0.6562025"
all.equal(m1, o1, check.attributes = FALSE)
#> [1] "Mean relative difference: 0.3962091"
all.equal(m2, o2, check.attributes = FALSE)
#> [1] "Mean relative difference: 0.3962472"
all.equal(m3, o3, check.attributes = FALSE)
#> [1] TRUE
I don't know why o1
and o2
would differ at all, though the difference is slight. Perhaps this is related to numerical imprecision due to the scaling we discussed earlier in this issue. More concerning to me is the difference between o1
and o3
and the difference between o1
and m1
.
Thanks for the informative post, @ngreifer .
Optmatch does its factor expansion the same way for Euclidean and for Mahalanobis distance. In the manner that we took matchIt to be doing its factor expansion, following this issue. That was like so:
model.matrix(x, mf,
contrasts.arg =
lapply(Filter(is.factor, mf),
function(x) {
stats::contrasts(x, contrasts = FALSE)/sqrt(2)
}))
I take it from what you're saying now that matchIt may be doing this when asked for a Euclidean distance, but it does something different for Mahalanobis? I'm not sure I'd want to introduce a branching step here, but it would be useful to know if this is indeed where our answers diverge.
(BTW we to first try to solve()
a covariance matrix, then revert to a generalized inverse if that can't be done. The covariance we try to solve is a pooled covariance not the raw covariance. This stuff happens subsequent to factor expansion as above, inside of compute_mahanalobis()
.)
Everything you described is consistent with how MatchIt
does it. I just wanted to alert you to a discrepancy that you might want to look into. I'm not sure what the cause of it is.
Thanks, Noah! Ben
I briefly looked into Noah's helpful example. Something that strikes me about it is that the model matrix based on Lalonde + form2
is much more singular than I had expected, with less than half of the singular values numerically nonzero:
table(nz)
#> nz
#> FALSE TRUE
#> 5 4
This also happens if the covariance is calculated without pooling.
On the other hand, if we generalized-invert the correlation instead of the covariance matrix, we do get the expected number of numerically nonzero singular values (all but one).
corr_u <- cor(data)
s_corr_u <- svd(corr_u)
nz_corr_u <- (s_corr_u$d > sqrt(.Machine$double.eps) * s_corr_u$d[1])
table(nz_corr_u)
#> nz_corr_u
#> FALSE TRUE
#> 1 8
##' Starting from the pooled covariance:
corr_p <- cv / outer(sqrt(diag(cv)), sqrt(diag(cv)))
s_corr_p <- svd(corr_p)
nz_corr_p <- (s_corr_p$d > sqrt(.Machine$double.eps) * s_corr_p$d[1])
table(nz_corr_p)
#> nz_corr_p
#> FALSE TRUE
#> 1 8
So I conjecture our Mahalanobis results would be more like his in this case, and more in line with the general user expectations he's modeling, if we standardized the columns of the data matrix before calculating the covariance and its (generalized) inverse.
To consider: if we go this route, should the standardization use means and SDs or robust alternatives, such as trimmed means and MADs? #134
That's a good observation. I think the extreme difference in the scales of the predictors can cause numerical problems like these.
@ngreifer reports:
Created on 2022-05-16 by the reprex package (v2.0.1)
Originally posted by @ngreifer in https://github.com/markmfredrickson/optmatch/issues/218#issuecomment-1128151151