Closed marcoyang1998 closed 4 weeks ago
Thanks @marcoyang1998, I appreciate your work. I think you could achieve a similar outcome by splitting the cutset into subset cutsets for each class, and then using mux to get a cutset to be passed to any of the existing samplers. It would also work with lazy manifests and bucketing.
class_cutsets = [cuts_class0, cuts_class1, ...]
class_weights = [w_class0, w_class1, ...]
cuts = CutSet.mux(*class_cutsets, weights=class_weights)
Hi Piotr,
thanks for the CutSet.mux
example. I thought about this before as well, and it should work very well if the samples under the same class share an equal weight. However, in my setup, which is a multi-class classification (AudioSet), every sample has a unique sampling weight and using CutSet.mux
for this scenario is impractical.
Add a weighted sampler, where each cut's sampling probability is proportional to its weight. This is useful for unbalanced dataset, where some classes have very few data. The weight for each cut should be computed by the user and passed to the sampler. It's similar to pytorch's
WeightedRandomSampler
(see here)This sampler only works with eager manifest since we need to perform sampling globally.