scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.86k stars 1.29k forks source link

Sample selection bias and up/down-sampling #540

Open rth opened 5 years ago

rth commented 5 years ago

It's a bit of an open-ended question. In my understanding up/down-sampling the input data depending on the target class is equivalent to having a dataset with sample selection bias. The possible impact of the latter on ML models is discussed e.g. by Zadrozny 2004.

In the use case of imbalanced-learn I gather that is not an issue because the sample selection only happens depending on the target variable y, not any of the features in X? (which corresponds to case 2 on page 2 of the above-linked paper).

An orthogonal question: assuming we do have some dataset with sample selection bias based on some feature in X (case 3, page 2 of the same paper). In other words, the distribution of one of the column of X does not match the real world distribution and we would like to compensate for it. Could one of the approaches in imbalanced-learn be used (or adapted) for it? Would something like this be in the scope of this project?

glemaitre commented 5 years ago

I would say yes. Then, we would need to think about the right module to do that.

In other words, the distribution of one of the columns of X does not match the real world distribution and we would like to compensate for it.

I did not look at the paper yet but is it related to importance sampling in which you would like to sample the X column such that it follows a given "real-world" distribution.

In the case of over-sampling, we could think about something similar in which you could estimate distribution (or parameters such as covariances) from other datasets and use this in the rebalancing procedure. It would be a kind of data augmentation using knowledge from data instead of randomly generation.

I would be really interested to implement such stuff or helping for it.

glemaitre commented 5 years ago

We should include some of these in 1.X

glemaitre commented 5 years ago

@rth Did you see some of the methods in the literature. Probably we should look at the fairness papers.

rth commented 5 years ago

I have not really looked into this question since opening this issue in February..

chkoar commented 4 years ago

Well,

Probably we should look at the fairness papers.

Yes. There is a body of research regarding this subject. I think that even this problem is imbalanced. So, we can tackle this inside imbalanced-learn. APIwise probably we may need some changes. I leave here one (of the many) relevant paper.