scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.81k stars 1.28k forks source link

PyTorch utilities sampler #424

Open glemaitre opened 6 years ago

glemaitre commented 6 years ago

We could add utilities for PyTorch. Basically it should be inheriting from torch.utils.data.Sampler.

The implementation could look like something:

class BalancedSampler(Sampler):

    def __init__(self, X, y, sampler=None, random_state=None):
        self.X = X
        self.y = y
        self.sampler = sampler
        self.random_state = random_state
        self._sample()

    def _sample(self):
        random_state = check_random_state(self.random_state)
        if self.sampler is None:
            self.sampler_ = RandomUnderSampler(return_indices=True,
                                               random_state=random_state)
        else:
            if not hasattr(self.sampler, 'return_indices'):
                raise ValueError("'sampler' needs to return the indices of "
                                 "the samples selected. Provide a sampler "
                                 "which has an attribute 'return_indices'.")
            self.sampler_ = clone(self.sampler)
            self.sampler_.set_params(return_indices=True)
            set_random_state(self.sampler_, random_state)

        _, _, self.indices_ = self.sampler_.fit_sample(self.X, self.y)
        # shuffle the indices since the sampler are packing them by class
        random_state.shuffle(self.indices_)

    def __iter__(self):
        return iter(self.indices_.tolist())

    def __len__(self):
        return len(self.X.shape[0])
chkoar commented 6 years ago

I can't help with this. I have never had the chance to play with PyTorch.

kaihhe commented 5 years ago

Is there any difference between I resample the data with the samplers before feed into neural networks and using the generators to train?

glemaitre commented 5 years ago

Memory usage mainly

On Fri, 16 Aug 2019 at 11:33, Kai He notifications@github.com wrote:

Is there any difference between I resample the data with the samplers before feed into neural networks and using the generators to train?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/imbalanced-learn/issues/424?email_source=notifications&email_token=ABY32P5G4ZXIUSZCJXBU3V3QEZX5JA5CNFSM4E7TFQX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4OFE2Y#issuecomment-521949803, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY32PY5A54LIM5OEG53EHTQEZX5JANCNFSM4E7TFQXQ .

-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/

mattbev commented 1 year ago

@glemaitre has any progress been made on this?

tuhinsharma121 commented 6 months ago

@jnothman @glemaitre Can I take it up if nobody is working on it?