lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
953 stars 217 forks source link

DynamicBucketingSampler smallest bucket very uneven #943

Closed danpovey closed 1 year ago

danpovey commented 1 year ago

I have noticed that in OOM problems when training our models, it tends to be related to very uneven-sized buckets, particularly involving shorter utterances. This can happen with DynamicBucketingSampler because the tails of the distribution can be long. What do you think is the correct fix for this?

pzelasko commented 1 year ago

What could help out of the box is increasing num_buckets (for better granularity in duration bins) and num_cuts_for_bins_estimate (for a more accurate distribution estimation).

We could try and design some heuristics like "max duration difference / ratio" but I'm concerned it could be difficult to tune to avoid under-utilizing the GPU.

Another option is extending the API to accept user-provided duration bins so that you could override the auto-estimated ones to account for long tailed distributions.

desh2608 commented 1 year ago

I have started adding a max_cuts constraint in addition to the max_duration recently to avoid OOM due to constant factors.

danpovey commented 1 year ago

I think a more principled solution that would not require too much tuning would be to make it possible for the total-duration constraint to be: the largest duration in the bucket, times the number of cuts in the bucket.

pzelasko commented 1 year ago

I think a more principled solution that would not require too much tuning would be to make it possible for the total-duration constraint to be: the largest duration in the bucket, times the number of cuts in the bucket.

We already did that, unfortunately it seems like that's not enough https://github.com/lhotse-speech/lhotse/blob/master/lhotse/dataset/sampling/base.py#L361

The tricky thing is that for other kinds of models I'm observing the issue to be typically with the longest duration bucket rather than the shortest one (batches from it tend to claim more GPU memory so most of the time the GPU tends to be under-utilized). I'm not sure if there's a single heuristic that can handle all of these cases. Maybe the best thing would be some sort of auto-tuning function that takes your data and model, tries a few different settings and suggests the settings for you...

danpovey commented 1 year ago

I think a more principled solution that would not require too much tuning would be to make it possible for the total-duration constraint to be: the largest duration in the bucket, times the number of cuts in the bucket.

We already did that, unfortunately it seems like that's not enough https://github.com/lhotse-speech/lhotse/blob/master/lhotse/dataset/sampling/base.py#L361

Are you sure you pointed to the correct line there? The line of code seems to be enforcing a max_cuts constraint rather than treating the max-duration as a limit on (num elements in batch * duration of longest element of batch). My proposed solution works equally well with long-duration buckets to solve the unevenness issue.

The tricky thing is that for other kinds of models I'm observing the issue to be typically with the longest duration bucket rather than the shortest one (batches from it tend to claim more GPU memory so most of the time the GPU tends to be under-utilized). I'm not sure if there's a single heuristic that can handle all of these cases. Maybe the best thing would be some sort of auto-tuning function that takes your data and model, tries a few different settings and suggests the settings for you...

pzelasko commented 1 year ago

My bad -- I wanted to point to line 365 in the same file: https://github.com/lhotse-speech/lhotse/blob/master/lhotse/dataset/sampling/base.py#L365

It tracks the longest duration seen and the number of cuts, and decides whether max_duration is exceeded based on longest_seen (duration) * num_cuts. This is what you had in mind, right?

danpovey commented 1 year ago

Oh I see. Yes, that looks right. Perhaps instead of waiting until the constraint is exceeded, what is needed to ask, "if we added this cut, would the constraint be exceeded?", and if so, just output what we have? The issue could be if we accumulate many small cuts in a batch and then one much longer one comes along that makes the constraint badly violated.

pzelasko commented 1 year ago

This is also pretty much what's being done, you can see it here:

https://github.com/lhotse-speech/lhotse/blob/0335d5ca03830d83f5ac188a322596d2a7d60d5f/lhotse/dataset/sampling/dynamic.py#L315-L324

When a cut is about to exceed the constraint we keep it for the next batch instead, and return what we have collected so far.

If you're using BucketingSampler or DynamicBucketingSampler and the durations are varying too much, it means that the buckets are not "tight enough" for your needs. Since we estimate the buckets duration bounds to evenly assign equal total duration to each, we can only control the "tightness" using the num_buckets parameter. But if it sounds interesting, we can try a different approach where the number of buckets is not fixed and instead we allow a max bucket duration width (e.g. for width=1 when the bucket is centered at duration D we allow [D - 0.5s, D + 0.5s)). The buckets are created dynamically as data becomes available (we pre-load buffer_size number of cuts to get an the buckets and then draw more as we go). The issue might be that if you don't have large enough data, you will be getting partial mini-batches and end up underutilizing the GPU anyway, so you still have to tweak some sampling parameters (in that case that would be width and possibly buffer_size).

danpovey commented 1 year ago

Maybe this was changed at some point and I am using an older version of Lhotse? My version is: '1.3.0.dev+git.4198446.clean' If the behavior is as you say, then the uneven-sized buckets should just have too-small total duration, they should not be too large and use excess memory. The code looks like this: logging.info("Using DynamicBucketingSampler.") train_sampler = DynamicBucketingSampler( cuts_train, max_duration=self.args.max_duration, shuffle=self.args.shuffle, num_buckets=self.args.num_buckets, drop_last=self.args.drop_last, ) and num-buckets is 30.

danpovey commented 1 year ago

Sorry guys, it turns out I was inadvertently using a super-old version of lhotse (1.3), in my site-packages was a lhotse/ subdirectory that it was picking up in preference to the .egg files, I must have done a dev install or something. So this may not be a real issue.

After updating lhotse, it's now batches with long utterances in them that are causing OOMs, as expected because the memory usage has a quadratic part.

One way to model this kind of thing would be to provide to the sampler a duration at which the quadratic part of the memory usage can be assumed to be the same as the linear part. So when enforcing the max-duration constraint, we could take the effective duration to be duration + duration^2 / quadratic_duration. I imagine this would be of the order of 15 to 40 seconds for the types of attention models we are using.

pzelasko commented 1 year ago

Sorry guys, it turns out I was inadvertently using a super-old version of lhotse (1.3), in my site-packages was a lhotse/ subdirectory that it was picking up in preference to the .egg files, I must have done a dev install or something. So this may not be a real issue.

After updating lhotse, it's now batches with long utterances in them that are causing OOMs, as expected because the memory usage has a quadratic part.

Great! At least now I can rest assured there are no surprises in the current sampler implementation ;)

One way to model this kind of thing would be to provide to the sampler a duration at which the quadratic part of the memory usage can be assumed to be the same as the linear part. So when enforcing the max-duration constraint, we could take the effective duration to be duration + duration^2 / quadratic_duration. I imagine this would be of the order of 15 to 40 seconds for the types of attention models we are using.

Makes sense, I'll make a proof of concept PR later.

pzelasko commented 1 year ago

@danpovey I have a working draft ready, but didn't try to train anything yet -- take a look if this is consistent with what you had in mind and try it out if you'd like, or I'll try later and circle back. https://github.com/lhotse-speech/lhotse/pull/950

pzelasko commented 1 year ago

Please see https://github.com/lhotse-speech/lhotse/pull/950#issuecomment-1407089168

It seems to work well!