scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.86k stars 1.29k forks source link

[ENH] Non-IID data distribution #931

Closed U-n-Own closed 2 years ago

U-n-Own commented 2 years ago

<-- If you want to propose a new algorithm, please refer first to the scikit-learn inclusion criterion: https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms -->

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Non-IID data are data that sometime can be found when training models on distributed devices, these are unbalanced wrt the devices and have different distribution of labels as well. For example in Federated Learning there are plenty of those.
I want to propose an algorithm that takes some data and distributes them in an non-IID fashioned way, i had to do it for an experiment but i didn't find any general algorithm that do this, so I'm proposing to create one, don't know if here is the right place.

Describe alternatives you've considered

Ideally, we take the data and the labels, then we can distribute our data in two ways or a mix of the two: unbalancing the data on each sub-distribution or unbalancing around the labels on each sub-distribution.

Additional context

hayesall commented 2 years ago

This question is a little too broad at the moment, we mainly focus on extensions and issues for imblearn here.

More-general Q&A forums for machine learning topics (e.g. https://stats.stackexchange.com/) might be a better fit for this.

Follow-up in the future if there's a good way to approach this. #105 is also where notes on new methods are currently tracked.