Orthogonal / Double Machine Learning: highly imbalanced treatment labels and highly skewed outcome labels?

py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.

Other

3.77k stars 713 forks source link

Hi! Anyone knows of existing papers or methods, that applies Double ML to highly imbalanced dataset with, for example, a dataset with less than 1% samples being positively treated while the other >99% samples are untreated? Another example would be, a dataset with continuous outcome yet most Y has value being 0 and less than 1% outcome has great-than-zero value of Y?

Traditionally with supervised machine learning problems such as fraud detection, people use many techniques such as downsample/upsample the outcome, etc., to prevent the fitting of a trivial predictive model that predict every sample with the majority class. However this might introduce bias to Double ML's result coefficients in the last stage if we does this upsample / downsample in the first-stage ML models that predict T & Y. Any existing papers that have address this?

py-why / EconML

Orthogonal / Double Machine Learning: highly imbalanced treatment labels and highly skewed outcome labels? #472