Open rth opened 5 years ago
I would say yes. Then, we would need to think about the right module to do that.
In other words, the distribution of one of the columns of X does not match the real world distribution and we would like to compensate for it.
I did not look at the paper yet but is it related to importance sampling in which you would like to sample the X column such that it follows a given "real-world" distribution.
In the case of over-sampling, we could think about something similar in which you could estimate distribution (or parameters such as covariances) from other datasets and use this in the rebalancing procedure. It would be a kind of data augmentation using knowledge from data instead of randomly generation.
I would be really interested to implement such stuff or helping for it.
We should include some of these in 1.X
@rth Did you see some of the methods in the literature. Probably we should look at the fairness papers.
I have not really looked into this question since opening this issue in February..
Well,
Probably we should look at the fairness papers.
Yes. There is a body of research regarding this subject. I think that even this problem is imbalanced. So, we can tackle this inside imbalanced-learn
. APIwise probably we may need some changes. I leave here one (of the many) relevant paper.
It's a bit of an open-ended question. In my understanding up/down-sampling the input data depending on the target class is equivalent to having a dataset with sample selection bias. The possible impact of the latter on ML models is discussed e.g. by Zadrozny 2004.
In the use case of imbalanced-learn I gather that is not an issue because the sample selection only happens depending on the target variable
y
, not any of the features inX
? (which corresponds to case 2 on page 2 of the above-linked paper).An orthogonal question: assuming we do have some dataset with sample selection bias based on some feature in
X
(case 3, page 2 of the same paper). In other words, the distribution of one of the column of X does not match the real world distribution and we would like to compensate for it. Could one of the approaches in imbalanced-learn be used (or adapted) for it? Would something like this be in the scope of this project?