py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.77k stars 713 forks source link

Orthogonal / Double Machine Learning: highly imbalanced treatment labels and highly skewed outcome labels? #472

Open ghost opened 3 years ago

ghost commented 3 years ago

Hi! Anyone knows of existing papers or methods, that applies Double ML to highly imbalanced dataset with, for example, a dataset with less than 1% samples being positively treated while the other >99% samples are untreated? Another example would be, a dataset with continuous outcome yet most Y has value being 0 and less than 1% outcome has great-than-zero value of Y?

Traditionally with supervised machine learning problems such as fraud detection, people use many techniques such as downsample/upsample the outcome, etc., to prevent the fitting of a trivial predictive model that predict every sample with the majority class. However this might introduce bias to Double ML's result coefficients in the last stage if we does this upsample / downsample in the first-stage ML models that predict T & Y. Any existing papers that have address this?

yuxinchenNU commented 2 years ago

Was wondering if there is an update on this? It's quite often that the outcome variable is skewed but if performing a log transform, the treatment effect automatically produced from the framework would be on the log scale.