Closed pzelasko closed 3 months ago
@lifeiteng care to try this one?
I tried this with config below, and training on librispeech spent the same(21 minutes/epoch)
sampler = DynamicBucketingSampler(
cuts,
max_duration=200,
shuffle=True,
num_buckets=30,
buffer_size=10000,
shuffle_buffer_size=10000,
drop_last=True,
rank=0,
world_size=1,
)
dataset = IterableDatasetWrapper(dataset=dataset, sampler=sampler)
dloader = torch.utils.data.DataLoader(
dataset=dataset,
batch_size=None,
persistent_workers=False,
num_workers=8,
pin_memory=False,
worker_init_fn=make_worker_init_fn(
rank=global_rank,
world_size=world_size,
),
prefetch_factor=40,
)
Interesting. How many GPUs? Can you also try increasing the buffer size to 50k? Otherwise maybe the batch duration is too low to notice a difference.
I observed a 10% speedup on a 2 GPU setup but need to investigate further.
8 A100 GPUs. I will try to increase the buffer size to 50K and report result tomorrow.
Aside from that, your max duration seems low for A100. Try adding quadratic_duration=15 to the sampler and you’ll probably be able to increase max duration by 100-200 (but I’d expect you to be able to set it somewhere around at least 500-600 in the first place, are you using bf16/fp16)?
Aside from that, your max duration seems low for A100. Try adding quadratic_duration=15 to the sampler and you’ll probably be able to increase max duration by 100-200 (but I’d expect you to be able to set it somewhere around at least 500-600 in the first place, are you using bf16/fp16)?
FP32 now
Aside from that, your max duration seems low for A100. Try adding quadratic_duration=15 to the sampler and you’ll probably be able to increase max duration by 100-200 (but I’d expect you to be able to set it somewhere around at least 500-600 in the first place, are you using bf16/fp16)?
sampler = DynamicBucketingSampler(
cuts,
max_duration=300,
shuffle=True,
num_buckets=30,
buffer_size=50000,
shuffle_buffer_size=10000,
quadratic_duration=15,
drop_last=True,
rank=0,
world_size=1,
)
it cost 31 minutes one epoch with this config.
Mmm it seems you are using webdataset or Lhotse shar, so when batch size or buffer size grows, the initialization of dataloader (on first step of iteration) takes longer as it has to read more data into memory. Try precomputing duration bins by running ‘estimate_duration_bins’ on your data and providing the output to samplers duration bins argument. You can then try to revert buffer size to the original setting and compare with and without sync buffers again.
Just pushed a version that is better tested and supports both map-style and iterable-style datasets.
Mmm it seems you are using webdataset or Lhotse shar, so when batch size or buffer size grows, the initialization of dataloader (on first step of iteration) takes longer as it has to read more data into memory. Try precomputing duration bins by running ‘estimate_duration_bins’ on your data and providing the output to samplers duration bins argument. You can then try to revert buffer size to the original setting and compare with and without sync buffers again.
duration | bucket_buffer_size | sync_buffer | mins/epoch | estimate_duration_bins |
---|---|---|---|---|
200 | 10k | False | 21 | False |
200 | 10k | True | 21 | False |
200 | 10k | False | 21 | True |
200 | 10k | True | 21 | True |
300 | 50k | Fals | 18 | False |
300 | 50k | True | 17 | False |
300 | 50k | False | 16 | True |
300 | 50k | True | 16 | True |
I've tested this change more thoroughly and I'm now confident it helps with the training speed. When training NeMo FastConformer RNNT+CTC ASR on a ~20k hours dataset with 16 GPUs I observed a 13% faster training step time when bucket selection is synchronized, everything else being the same. The validation WER of both runs is very close on the same number of steps, but the convergence is actually a bit quicker when we consider validation WER vs total training time.
On a separate experiment with 2 GPUs I observed an 8% speedup. I expect the speedup to grow with the size of distributed training, as the probability of hitting the slowest bucket on each training step grows with the number of GPUs. The speedup is also likely model-dependent (the bigger the variance of processing time per sequence length bucket, the greater the speedup).
Follow up to #863 and #1309
This version seems to work as intended, it consistently picks the same buckets on each DDP rank. It depends on good
duration_bins
initialization (i.e. it has to be estimated on the actual training data to fit the duration distribution well) and large enoughbuffer_size
, so that all buckets are filled enough to yield at least 1 mini-batch for the most of the time. If it hits a non-ready bucket it tries again with its neighbors.I'm determining what kind of speedup can be expected from this, also need to add proper tests, and if I find it's good enough, I'll probably make it the default.