General matching - Githubissues

kamoors commented 1 year ago

Dear sir/madam,

I am investigating the combined effect of a drug and one nutrient in a cohort of 1280 people. However, I only have 77 patients / samples using the drug. I wanted to match these patients based on BMI, age, Gender, and general food intake. I wrote it like this:

m.out1 <- matchit(drug ~ BMI_mb + GENDER + AGE + food, data = data, method = "full", distance = "glm")

The balance seems ok.

Can I then go into my metabolomics data and run a glm like this?

glm("*metabolite ~ drug nutrient + BMI_mb + GENDER + AGE + food + batch + subclass**", data = test_df,weights = mets_samples$weights)

Or am I only ''allowed'' to use the matched data with the functions described in the vignette?

kamoors commented 1 year ago

Additionally, if I want to do bootstrapping with my glm output, will the weights be a problem since not all samples from each subclass will be chosen?

ngreifer commented 1 year ago

This is primarily a statistical consulting question and not really about using MatchIt. I recommend you seek the services of a statistical consultant with experience in causal inference. I'll offer a few thoughts here, but I don't have the bandwidth to fully advise on your methodology, and to provide such advice would require a much longer conversation about your estimand, the meaning of your variables, the goals of the study, etc.

A few things that strike me as odd:

Why aren't you balancing on nutrient in the matching?
Why are you settling for okay balance instead of excellent balance?
Why didn't you extract the matched dataset using match.data()?
Why didn't you include the matching weights in the outcome model?
If you are using sampling weights, why didn't you include those in the call to matchit()?
Why are you using glm() and not lm() if you are fitting a linear model? If you are not fitting a linear model, why don't you have family specified?
Why are you including subclass as a predictor instead of using it to adjust the standard errors?
How are you computing the treatment effect from the model?
If you are bootstrapping, follow the bootstrapping instructions in the vignette on estimating effects. You need to bootstrap subclasses, not individuals. Why do you want to bootstrap?

My advice is to follow the advice that already exists. The vignettes provide very clear instructions on how to estimate treatment effects using best practices. Why do you want to deviate from these instructions? I recommend seeking a statistical consultant who can help you answer these questions if you don't know the answers to them. Matching is an advanced statistical method that requires expertise to do well, especially if you want to deviate from known best practices.

kamoors commented 1 year ago

Thank you for answering so quickly! I think the questions you ask offer food for thought. Let me try to answert them so that maybe I can distill a solution to my problem (I'll answer in the quote below).

Just a quick note on the research. From the nutritional information, I have hundreds of variables, and I am only interested in the interaction of the drug (0 or 1) with this specific nutrient.

I have many different data types, but the overall question is what this interaction does to the gut microbiome (and subsequently the host). We are employing constraint-based modelling, community modelling, and metabolomics to answer this question.

This is primarily a statistical consulting question and not really about using MatchIt. I recommend you seek the services of a statistical consultant with experience in causal inference. I'll offer a few thoughts here, but I don't have the bandwidth to fully advise on your methodology, and to provide such advice would require a much longer conversation about your estimand, the meaning of your variables, the goals of the study, etc.

A few things that strike me as odd:

Why aren't you balancing on nutrient in the matching? I was considering doing that but the idea behind matching is to equalize the ''other'' variables, right? If I include my nutrient of interest, wouldn't that nullify the differences?

Why are you settling for okay balance instead of excellent balance?

Why didn't you extract the matched dataset using match.data()? The question only contains a small section of my code. I followed the vignette and tested multiple settings of matchit(). So, yes, I do use the output of match.data()"

Why didn't you include the matching weights in the outcome model? I use the match.data() ''weights'' column for the weights parameter of the glm

If you arr using sampling weights, why didn't you include those in the call to matchit()

Why are you using glm() and not lm() if you are fitting a linear model? If you are not fitting a linear model, why don't you have family specified? I have community modelling data where the residuals are distributed non-normally in various cases. This is just a carry-over from those analyses (since glm with 'gaussian' is just an lm)

Why are you including subclass as a predictor instead of using it to adjust the standard errors? This was the thing that I was very unsure about. I realize that doing so might introduce colinearity, that's why I'm asking the questions here :)

How are you computing the treatment effect from the model? The idea was to fit a model per metabolite / exchanged metabolite (from the metabolic modelling) to establish if the use of the drug in combination with the nutrient have a specific effect on that metabolite / flux / bacterial species.

If you are bootstrapping, follow the bootstrapping instructions in the vignette on estimating effects. You need to bootstrap subclass, not individuals. Why do you want to bootstrap? I found these instructions after posting. Thanks!

My advice is to follow the advice that already exists. The vignettes provide very clear instructions on how to estimate treatment effects using best practices. Why do you want to deviate from these instructions? I recommend seeking a statistical consultant who can help you answer these questions if you don't know the answers to them. Matching is an advanced statistical method that requires expertise to do well, especially if you want to deviate from known best practices.

ngreifer commented 1 year ago

Some responses:

Matching on nutrient is meant to nullify the association between nutrient and drug, but that does not affect the relationships between nutrient and metabolite or between drug and metabolite, which are what you are studying. The estimating effects vignette provides instructions for moderation analysis, and balancing on the moderator is important. Ideally, you have balance within each level of the moderator (i.e., at each level of the moderator, treatment is independent of covariates). This usually involves including interactions between the moderator and the covariates.
Including subclass as a predictor doesn't (just) introduce colinearity, but it fails to preserve the estimand and can limit the degrees of freedom for your estimates, which makes it harder to detect effects. Where did you see this method used? It is not a standard practice (though in some cases it can be equivalent to using a cluster-robust SE).
Fitting a model is not enough to estimate a treatment effect; there are instructions in the vignette for extracting a treatment effect estimate from a model using g-computation. Using the coefficient on treatment in the outcome model as the treatment effect is not generally a valid method and it must be done with care. Matching is not meant to provide a good outcome model; it is meant to enable unbiased estimation of the treatment effect, which is done with g-computation.

kamoors commented 1 year ago

The general issue that I had with this matching procedure is that I am basically looking for an effect of a combined variable, where drug use is 0 or 1, but the nutrient is a continuous variable. Since in matchit, you are looking at the treatment effect (which comes from 0 or 1), I wonder if I can even use this matching procedure for my purpose.. Plus, this is all existing data that was created with a different purpose, so I have to make do with what I have...

Regarding including subclass, my rationale for including it was that there might be some specific subclass effects that only occur within that subclass. I thought that by including them I would diminish those effects, but clearly not..

Overall, the effect of the drug separately and the nutrient separately have been studeide before, but there is evidence of specific interaction effect under certain cases. I want to know if we can also see this effect in the general population. So, we want to find observations that might affected by the interaction..

ngreifer commented 1 year ago

It is possible to study the effect of two combined treatments, but you can't use matching for that. One approach is described in Vandrweele (2009). This is a very advanced problem and there has not been much work done on it. I would recommend collaborating with a methodologist who has expertise in causal inference to do this research.

kamoors commented 1 year ago

Ok, thank you for all your help!

kosukeimai / MatchIt

General matching #174