Cut mixing issues - Githubissues

osadj commented 6 months ago

@pzelasko I recently started noticing some reproducibility problems and accuracy regressions in my experiments and I was able to track it down to the recent changes in the mix method of CutSet. Previously, we were eagerly using the sample method to select noise cuts to be mixed with the current cut, however starting in v1.19 (which was buggy and returned the original in addition to the augmented), the new LazyCutMixer uses a very small shuffling buffer of 100 that causes the noise cuts to be sampled almost always from the same noise type (let's say I have 500 babble, 500 music, and 500 general non-speech samples in my noise CutSet). This is a highly undesirable behavior; ideally we want to sample the noise set at random with replacement to ensure acoustic diversity during training.

While I was digging into this issue, I also noticed that the logic for truncation and the subsequent while loop is not correct. Let's assume that we have a batch of 3 samples with a max duration of 10 seconds (one audio is 3s long, one is 5s long, and the longest is 10s long). Currently, in the code we truncate the noise by the current cut duration, while the while loop uses the target duration (i.e., the max duration in the batch) to keep sampling noise cuts until we mix duration amount of noise at a given SNR. This is unnecessary if the first sampled noise already supports the target duration, but because we truncate by the cut duration the while loop is still performed. Also, the random sub-region in the noise is only selected the first time we sample a noise and not in the while loop.

Can you please help address these? Thanks.

pzelasko commented 6 months ago

This is interesting, can you share a bit more about the scale of the degradation and whether fixing these issues helped recover the original performance?

the new LazyCutMixer uses a very small shuffling buffer of 100 that causes the noise cuts to be sampled almost always from the same noise type (let's say I have 500 babble, 500 music, and 500 general non-speech samples in my noise CutSet).

You can either pre-shuffle the noise manifest lines, or call noise_cuts = noise_cuts.to_eager() to load them in memory if they're not too large (in that case .shuffle will perform true and not approximate shuffling). If the noise cuts dataset is very large I think pre-shuffling is the way to go. We could technically expose the shuffle buffer size option but I feel like it may just be too many options.

While I was digging into this issue, I also noticed that the logic for truncation and the subsequent while loop is not correct. Let's assume that we have a batch of 3 samples with a max duration of 10 seconds (one audio is 3s long, one is 5s long, and the longest is 10s long). Currently, in the code we truncate the noise by the current cut duration, while the while loop uses the target duration (i.e., the max duration in the batch) to keep sampling noise cuts until we mix duration amount of noise. This is unnecessary if the first sampled noise already supports the target duration, but because we truncate by the cut duration the while loop is still performed. Also, the random sub-region in the noise is only selected the first time we sample a noise and not in the while loop.

You have a point here, would you be willing to contribute a PR that fixes those issues?

osadj commented 6 months ago

Thanks @pzelasko. I see 3-5% relative increase in WER depending on the test set. I can set the noise cut set to eager, but the issue here is the repeat that returns a lazy indefinite copy. Pre-shufflling helps to some extent, but LazyCutMixer still doesn't behave the same as with eager sampling (i.e., cuts.sample()). I'm wondering if it's possible to restore the previous eager approach and let the user choose between the two approaches.

You have a point here, would you be willing to contribute a PR that fixes those issues?

Sure. Let me take a stab at this and submit a PR.

kamirdin commented 6 months ago

May I ask why do we have to use

  mix_in_cuts = iter(self.mix_in_cuts.repeat().shuffle(rng=rng, buffer_size=100))
  to_mix = next(mix_in_cuts)

instead of

to_mix = random.choice(self.mix_in_cuts)

pzelasko commented 6 months ago

@osadj @kamirdin I took both of your PRs, merged them into one, and fixed some subtle issues. See #1315

You were both correct in the issues you identified. In general we have to support both eager and lazy cutset and we have no way of knowing if we can auto-convert lazy to eager (it could be huge, infinite, or represent in-memory audio data, all of which would just cause OOM on most systems). I tweaked your solutions so that now we have the following:

if you pass an eager cutset, you'll get a real random shuffle.
if you pass a lazy CutSet, we have a larger buffer size 100 -> 2000; 2000 is the size of MUSAN so for the most typical setup, it will do practically the same thing as eager even if you forget to load it as / convert it to eager.
for both cases the RNG is now stateful across iterations, so every time you restart dataloading iteration, the data will be shuffled differently (this can be turned off but it's probably only useful to turn it off for testing)

In addition you can always split the noise cutset into several pieces/shards, and load them as CutSet.from_files([p1, p2, ...], seed=...) (good option for seed is "trng" if you feel paranoid about the randomness) which will re-shuffle the order of shards each time you exhaust the dataset.

osadj commented 6 months ago

Thanks @pzelasko. It looks good. I have a followup question about the cut transform for noise mixing (i.e., CutMix). Currently, it takes a seed as input which is inconsistent with other cut transforms like speed/volume perturb and reverb that take random number generators. I think it makes sense to also use the same approach for CutMix, where we initialize a random number generator once in the beginning of training and pass it to CutMix which then repeatedly calls cuts.mix for every batch. Although we made the LazyCutMix stateful, CutMixinstantiates a new instance for every batch, rendering the state ineffective. Is my understanding correct?

pzelasko commented 6 months ago

Great catch!! I forgot about this API. This PR should fix that https://github.com/lhotse-speech/lhotse/pull/1316

pzelasko commented 6 months ago

Now that these PRs have been merged, could you confirm that the regressions are gone once you re-run your training? Thanks.

osadj commented 6 months ago

Confirmed! Thanks again for the help.

lhotse-speech / lhotse

Cut mixing issues #1312