mila-iqia / fuel

A data pipeline framework for machine learning
MIT License
867 stars 268 forks source link

Balanced sampling scheme #332

Open markusnagel opened 8 years ago

markusnagel commented 8 years ago

I created an iteration scheme that allows balanced sampling from different classes/groups/clusters. This is relevant for classification tasks where class imbalance is present (many real world problems). If for example a dataset is very skewed the learning algorithm might ignore the small classes (especially in early stage of training). Ways to overcome this are either weighting the underrepresented classes or to sample an equal amount of examples from each class. In this iteration scheme we focus on the latter one and enable fuel to do such an equal sampling. It both allows subsampling (i.e. downsample the over represented class) or upsampling (i.e. sample more often the under represented class with replacement) . The amount of samples per class can be specified manually. This iteration scheme is not only applicable for classification, it can be used for any kind of groups which should be represented equally in the training set (e.g. results from clustering to avoid too similar examples in semi-supervised learning).