ngreifer / WeightIt

WeightIt: an R package for propensity score weighting
https://ngreifer.github.io/WeightIt/
101 stars 12 forks source link

Error in if (sum(weights[treat == tval1_0] > 0) < 1 || sum(weights[treat != : missing value where TRUE/FALSE needed #64

Open mitchellcameron123 opened 3 weeks ago

mitchellcameron123 commented 3 weeks ago

Hi,

I am trying to perform GBM weighting on this dataset (attached). The dataset is in an excel document and publicly available.

coffee_data.csv

My code is: coffee_formula <- as.formula(as.factor(certified) ~ age_hh+ agesq+ nonfarmincome_access + logtotal_land+ depratio+ badweat+ edu+ gender+ years_cofeproduction+ access_credit)

mod <- weightit(coffee_formula, data = coffee_data, method="gbm", estimand = "ATE", shrinkage=0.1, interaction.depth=1:2, bag.fraction=0.8, criterion= "smd.mean", n.trees=10000)

This results in the error: Error in if (sum(weights[treat == tval1_0] > 0) < 1 || sum(weights[treat != : missing value where TRUE/FALSE needed

I can make this error stop by choosing fewer trees or messing with the other parameters given. There doesn't seem to be any logic to when the error appears. Using the browse() and looking inside the col_w_smd() function, it would appear that one of the weights is NaN formatting and I am not sure why but seems to cause this error.

I have tried:

I am hoping for some guidance here because I have no idea what to do. Thank you very much.

ngreifer commented 2 weeks ago

Thank you so much for this report and sorry about the bug. This occurred because some propensity scores were estimated to be 0 or 1 in some trees of the GBM model, which yielded non-finite weights that were not correctly processed by col_w_smd(). This issue has been fixed in the development version, which you can install using

remotes::install_github("ngreifer/WeightIt")

My solution was to truncate extreme propensity scores and improve the robustness of the function for computing weights. In practice, any tree with such extreme propensity scores will not be chosen as optimal because the weights will be extreme.

mitchellcameron123 commented 2 weeks ago

Dear Noah,

Thank you for your fast reply and involvement with this package and others. I am a master's student with a keen interest in causal inference. My project is producing a guide to causal inference using machine learning and the features in cobalt​, WeightIt​, and MatchIt​ have been fantastic, and the documentation is very informative.

I have also noticed an additional bug when tuning the model as well. I'm happy to open a new bug request on GitHub if you like. This may also be defined as a 'feature request'.

It is hard to explain over email, so I have created an annotated R script that shows the problem and two instances where this problem would exist. It should all be reproducible.

My gut feeling is that this is due to the random seed being different within for each individual fit within the grid search. E.g. If I set the seed before fitting, then spec 1 will use the set seed and spec 2 will continue to use the seed from where spec 1 finished. Thus, depending on the order of tuning parameters provided in other fits, the same tunes will have different outcomes as they use a different section of the seed. A short example of this is inside the script as well.

Perhaps there could be a seed argument passed into weightit​ that would set the seed for each gbm​ fit so that there is consistency. For example, in grf​'s causal_forest​. This means that for grid search 1,2,3 etc, there is the same grid meaning the order of the parameters does not change the results.

Kind Regards, Mitchell Cameron

ngreifer commented 2 weeks ago

Hi Mitchell,

Feel free to send me whatever would be helpful. Results for GBM are only random if you set bag.fraction to something less than 1. The default and recommended behavior is to set it to 1, in which case there is no randomness unless cross-validation is used to select the tuning parameter. When there is no randomness, you don't need to set a seed at all and your problem is avoided.

mitchellcameron123 commented 2 weeks ago

Hi Noah,

Thanks again for your fast reply.

To clarify, what would you like me to send that may be helpful? Did my R script come through or are you meaning that I should send this on GitHub as well?

Why is it recommended to not use bag.fraction​? My anecdotal experience and the original McCaffrey et al (2004) suggest that using stochastic boosting does improve performance. Although I would love to be corrected if there is a new consensus since 2004.

I do believe that having a seed option inside would be a value add for the consistency when using bag.fraction. For example, with an iterative tuning process, where multiple grid searches are completed getting more 'local' to optimal values, this consistency is quite important. Without the consistency, the local area of the parameters can be quite different to assess.

Kind regards, Mitchell


From: Noah Greifer @.> Sent: Tuesday, June 25, 2024 1:07 PM To: ngreifer/WeightIt @.> Cc: Mitchell Cameron @.>; Author @.> Subject: Re: [ngreifer/WeightIt] Error in if (sum(weights[treat == tval1_0] > 0) < 1 || sum(weights[treat != : missing value where TRUE/FALSE needed (Issue #64)

Hi Mitchell,

Feel free to send me whatever would be helpful. Results for GBM are only random if you set bag.fraction to something less than 1. The default and recommended behavior is to set it to 1, in which case there is no randomness unless cross-validation is used to select the tuning parameter. When there is no randomness, you don't need to set a seed at all and your problem is avoided.

— Reply to this email directly, view it on GitHubhttps://github.com/ngreifer/WeightIt/issues/64#issuecomment-2187716264, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BGMHFBMGWA5DKFZOYD7HI73ZJC7EVAVCNFSM6AAAAABJTLRMI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBXG4YTMMRWGQ. You are receiving this because you authored the thread.Message ID: @.***>

ngreifer commented 2 weeks ago

Hi Mitchell,

I didn't receive an attachment. I don't think email attachments work in GitHub issues; you need to upload the attachment directly into the issue on GitHub rather than including it in an email reply. But I think I understand the problem anyway: if the seed differs between runs with different tuning parameters, any differences in performance could be due either to the different samples drawn at each iteration or to the different values of the tuning parameters, so ordering the resulting specifications by performance doesn't accurately allow you to assess which specification is best in general. Holding the samples drawn at each iteration constant (i.e., by using the same seed each time the model is fit with a different tuning parameter specification) allows you to isolate the variability in performance due to the different specifications.

It's true that the original literature on GBM and McCaffrey et al (2004) recommend using a bag fraction less than 1, but I don't think that advice applies anymore. It relies on the idea that a machine learning model should seek to avoid overfitting, but in propensity score analysis, overfitting is not a problem because balance achieved in the training sample, not good predictive performance in the test sample, is the criterion of interest. Overfitting to prevent perfect separation is controlled by other parameters already, including the number of trees and the shrinkage parameter. So I don't think one needs to introduce additional randomness to prevent overfitting.

twang, the package developed by the team that wrote McCaffrey et al (2004) to implement the method, has always used a bag.fraction of 1 as the default, and it is not even named as a manipulable parameter in the user guide, suggesting that they recommend not changing it.

All that said, I'll take your suggestion on board. R has facilities to retain and reset a seed, so it's possible to ensure that each fit of the model uses the same seed.

mitchellcameron123 commented 2 weeks ago

Hi,

Yes, I believe we are on the same page. I've attached the file anyway (probably redundant now) as a txt file as GitHub does not seem to allow R files.

For Noah.txt

Thank you very much for all of your work on this package :)