pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.33k stars 6.97k forks source link

Enable custom samplers for imbalanced datasets #8093

Open PierreQuinton opened 1 year ago

PierreQuinton commented 1 year ago

🚀 The feature

For each classification datasets with balanced distribution on the classes (MNIST, CIFAR-N, etc...), it would be very useful to provide a standard dataset for the imbalanced version of the dataset. For a dataset with $n$ classes, define the imbalance factor $a\in [0,1]$, then the proportion of class $i$ is typically be proportional to $a^{i/(n-1)}$, we need to normalize so that the proportions sums to $1$. For $a=1$ this is uniform and the smaller the imbalance coefficient the more imbalanced the dataset is.

I am not sure if torch vision should provide with the datasets or provide a data loader that imbalance the dataset.

Motivation, pitch

Many papers are published on the problem of training on an imbalanced dataset and testing on a balanced dataset, for instance see this. As far as I know, there is no systematic way of generating such data sets for people using Pytorch. Here are few very similar implementations that are not fully satisfying :

Such datasets seems to exist on TensorFlow, for instance section 3 of the readme of this repo provides with links to download tfrecord datasets.

I feels like it could be a very nice feature of torchvision to either contain such datasets or be able to craft them easily.

Alternatives

No response

Additional context

No response

cc @pmeier

NicolasHug commented 1 year ago

Hi @PierreQuinton ,

It seems like what you need is a custom Sampler. IIUC, https://github.com/ufoym/imbalanced-dataset-sampler should be pretty close to what you're looking for?

PierreQuinton commented 1 year ago

@NicolasHug Thanks for your answer, yes this is exactly what I am looking for. I'm not sure if you would like to add something similar to torch or if you would close the issue, I leave it up to you.

NicolasHug commented 1 year ago

Thanks @PierreQuinton . I'll keep the issue open and rename it for clarity. Ultimately, what is needed to enable that is:

i) is definitely in scope for torchvision and this is something we'd be doing if we ever re-start our work on a dataset revamp (CC @pmeier ). For ii), we can decide when the time comes, but I don't see why not