lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
902 stars 204 forks source link

Add new sampler: weighted sampler #1344

Closed marcoyang1998 closed 4 weeks ago

marcoyang1998 commented 1 month ago

Add a weighted sampler, where each cut's sampling probability is proportional to its weight. This is useful for unbalanced dataset, where some classes have very few data. The weight for each cut should be computed by the user and passed to the sampler. It's similar to pytorch's WeightedRandomSampler (see here)

This sampler only works with eager manifest since we need to perform sampling globally.

pzelasko commented 1 month ago

Thanks @marcoyang1998, I appreciate your work. I think you could achieve a similar outcome by splitting the cutset into subset cutsets for each class, and then using mux to get a cutset to be passed to any of the existing samplers. It would also work with lazy manifests and bucketing.

class_cutsets = [cuts_class0, cuts_class1, ...]
class_weights = [w_class0, w_class1, ...]
cuts = CutSet.mux(*class_cutsets, weights=class_weights)
marcoyang1998 commented 1 month ago

Hi Piotr,

thanks for the CutSet.mux example. I thought about this before as well, and it should work very well if the samples under the same class share an equal weight. However, in my setup, which is a multi-class classification (AudioSet), every sample has a unique sampling weight and using CutSet.mux for this scenario is impractical.