Closed osadj closed 6 months ago
This is interesting, can you share a bit more about the scale of the degradation and whether fixing these issues helped recover the original performance?
the new
LazyCutMixer
uses a very small shuffling buffer of 100 that causes the noise cuts to be sampled almost always from the same noise type (let's say I have 500 babble, 500 music, and 500 general non-speech samples in my noiseCutSet
).
You can either pre-shuffle the noise manifest lines, or call noise_cuts = noise_cuts.to_eager()
to load them in memory if they're not too large (in that case .shuffle
will perform true and not approximate shuffling). If the noise cuts dataset is very large I think pre-shuffling is the way to go. We could technically expose the shuffle buffer size option but I feel like it may just be too many options.
While I was digging into this issue, I also noticed that the logic for truncation and the subsequent
while
loop is not correct. Let's assume that we have a batch of 3 samples with a max duration of 10 seconds (one audio is 3s long, one is 5s long, and the longest is 10s long). Currently, in the code we truncate the noise by the current cut duration, while thewhile
loop uses the target duration (i.e., the max duration in the batch) to keep sampling noise cuts until we mixduration
amount of noise. This is unnecessary if the first sampled noise already supports the targetduration
, but because we truncate by the cut duration thewhile
loop is still performed. Also, the random sub-region in the noise is only selected the first time we sample a noise and not in the while loop.
You have a point here, would you be willing to contribute a PR that fixes those issues?
Thanks @pzelasko. I see 3-5% relative increase in WER depending on the test set.
I can set the noise cut set to eager, but the issue here is the repeat
that returns a lazy indefinite copy. Pre-shufflling helps to some extent, but LazyCutMixer
still doesn't behave the same as with eager sampling (i.e., cuts.sample()
). I'm wondering if it's possible to restore the previous eager approach and let the user choose between the two approaches.
You have a point here, would you be willing to contribute a PR that fixes those issues?
Sure. Let me take a stab at this and submit a PR.
May I ask why do we have to use
mix_in_cuts = iter(self.mix_in_cuts.repeat().shuffle(rng=rng, buffer_size=100))
to_mix = next(mix_in_cuts)
instead of
to_mix = random.choice(self.mix_in_cuts)
@osadj @kamirdin I took both of your PRs, merged them into one, and fixed some subtle issues. See #1315
You were both correct in the issues you identified. In general we have to support both eager and lazy cutset and we have no way of knowing if we can auto-convert lazy to eager (it could be huge, infinite, or represent in-memory audio data, all of which would just cause OOM on most systems). I tweaked your solutions so that now we have the following:
In addition you can always split the noise cutset into several pieces/shards, and load them as CutSet.from_files([p1, p2, ...], seed=...)
(good option for seed is "trng"
if you feel paranoid about the randomness) which will re-shuffle the order of shards each time you exhaust the dataset.
Thanks @pzelasko. It looks good.
I have a followup question about the cut transform for noise mixing (i.e., CutMix
). Currently, it takes a seed
as input which is inconsistent with other cut transforms like speed/volume perturb and reverb that take random number generators. I think it makes sense to also use the same approach for CutMix
, where we initialize a random number generator once in the beginning of training and pass it to CutMix
which then repeatedly calls cuts.mix
for every batch. Although we made the LazyCutMix
stateful, CutMix
instantiates a new instance for every batch, rendering the state ineffective.
Is my understanding correct?
Great catch!! I forgot about this API. This PR should fix that https://github.com/lhotse-speech/lhotse/pull/1316
Now that these PRs have been merged, could you confirm that the regressions are gone once you re-run your training? Thanks.
Confirmed! Thanks again for the help.
@pzelasko I recently started noticing some reproducibility problems and accuracy regressions in my experiments and I was able to track it down to the recent changes in the
mix
method ofCutSet
. Previously, we were eagerly using thesample
method to select noise cuts to be mixed with the current cut, however starting in v1.19 (which was buggy and returned the original in addition to the augmented), the newLazyCutMixer
uses a very small shuffling buffer of 100 that causes the noise cuts to be sampled almost always from the same noise type (let's say I have 500 babble, 500 music, and 500 general non-speech samples in my noiseCutSet
). This is a highly undesirable behavior; ideally we want to sample the noise set at random with replacement to ensure acoustic diversity during training.While I was digging into this issue, I also noticed that the logic for truncation and the subsequent
while
loop is not correct. Let's assume that we have a batch of 3 samples with a max duration of 10 seconds (one audio is 3s long, one is 5s long, and the longest is 10s long). Currently, in the code we truncate the noise by the current cut duration, while thewhile
loop uses the target duration (i.e., the max duration in the batch) to keep sampling noise cuts until we mixduration
amount of noise at a given SNR. This is unnecessary if the first sampled noise already supports the targetduration
, but because we truncate by the cut duration thewhile
loop is still performed. Also, the random sub-region in the noise is only selected the first time we sample a noise and not in the while loop.Can you please help address these? Thanks.