Dynamic bucket selection rng sync

pzelasko commented 4 months ago

Follow up to #863 and #1309

This version seems to work as intended, it consistently picks the same buckets on each DDP rank. It depends on good duration_bins initialization (i.e. it has to be estimated on the actual training data to fit the duration distribution well) and large enough buffer_size, so that all buckets are filled enough to yield at least 1 mini-batch for the most of the time. If it hits a non-ready bucket it tries again with its neighbors.

I'm determining what kind of speedup can be expected from this, also need to add proper tests, and if I find it's good enough, I'll probably make it the default.

pzelasko commented 4 months ago

@lifeiteng care to try this one?

kobenaxie commented 4 months ago

I tried this with config below, and training on librispeech spent the same(21 minutes/epoch)

sampler = DynamicBucketingSampler(
         cuts,
         max_duration=200,
         shuffle=True,
         num_buckets=30,
         buffer_size=10000,
         shuffle_buffer_size=10000,
         drop_last=True,
         rank=0,
         world_size=1,
     )

     dataset = IterableDatasetWrapper(dataset=dataset, sampler=sampler)

     dloader = torch.utils.data.DataLoader(
         dataset=dataset,
         batch_size=None,
         persistent_workers=False,
         num_workers=8,
         pin_memory=False,
         worker_init_fn=make_worker_init_fn(
             rank=global_rank,
             world_size=world_size,
         ),
         prefetch_factor=40,
     )

pzelasko commented 4 months ago

Interesting. How many GPUs? Can you also try increasing the buffer size to 50k? Otherwise maybe the batch duration is too low to notice a difference.

I observed a 10% speedup on a 2 GPU setup but need to investigate further.

kobenaxie commented 4 months ago

8 A100 GPUs. I will try to increase the buffer size to 50K and report result tomorrow.

pzelasko commented 4 months ago

Aside from that, your max duration seems low for A100. Try adding quadratic_duration=15 to the sampler and you’ll probably be able to increase max duration by 100-200 (but I’d expect you to be able to set it somewhere around at least 500-600 in the first place, are you using bf16/fp16)?

kobenaxie commented 4 months ago

Aside from that, your max duration seems low for A100. Try adding quadratic_duration=15 to the sampler and you’ll probably be able to increase max duration by 100-200 (but I’d expect you to be able to set it somewhere around at least 500-600 in the first place, are you using bf16/fp16)?

FP32 now

kobenaxie commented 4 months ago

Aside from that, your max duration seems low for A100. Try adding quadratic_duration=15 to the sampler and you’ll probably be able to increase max duration by 100-200 (but I’d expect you to be able to set it somewhere around at least 500-600 in the first place, are you using bf16/fp16)?

    sampler = DynamicBucketingSampler(
         cuts,
         max_duration=300,
         shuffle=True,
         num_buckets=30,
         buffer_size=50000,
         shuffle_buffer_size=10000,
         quadratic_duration=15,
         drop_last=True,
         rank=0,
         world_size=1,
     )

it cost 31 minutes one epoch with this config.

pzelasko commented 4 months ago

Mmm it seems you are using webdataset or Lhotse shar, so when batch size or buffer size grows, the initialization of dataloader (on first step of iteration) takes longer as it has to read more data into memory. Try precomputing duration bins by running ‘estimate_duration_bins’ on your data and providing the output to samplers duration bins argument. You can then try to revert buffer size to the original setting and compare with and without sync buffers again.

pzelasko commented 4 months ago

Just pushed a version that is better tested and supports both map-style and iterable-style datasets.

kobenaxie commented 4 months ago

Mmm it seems you are using webdataset or Lhotse shar, so when batch size or buffer size grows, the initialization of dataloader (on first step of iteration) takes longer as it has to read more data into memory. Try precomputing duration bins by running ‘estimate_duration_bins’ on your data and providing the output to samplers duration bins argument. You can then try to revert buffer size to the original setting and compare with and without sync buffers again.

duration	bucket_buffer_size	sync_buffer	mins/epoch	estimate_duration_bins
200	10k	False	21	False
200	10k	True	21	False
200	10k	False	21	True
200	10k	True	21	True
300	50k	Fals	18	False
300	50k	True	17	False
300	50k	False	16	True
300	50k	True	16	True

pzelasko commented 3 months ago

I've tested this change more thoroughly and I'm now confident it helps with the training speed. When training NeMo FastConformer RNNT+CTC ASR on a ~20k hours dataset with 16 GPUs I observed a 13% faster training step time when bucket selection is synchronized, everything else being the same. The validation WER of both runs is very close on the same number of steps, but the convergence is actually a bit quicker when we consider validation WER vs total training time.

On a separate experiment with 2 GPUs I observed an 8% speedup. I expect the speedup to grow with the size of distributed training, as the probability of hitting the slowest bucket on each training step grows with the number of GPUs. The speedup is also likely model-dependent (the bigger the variance of processing time per sequence length bucket, the greater the speedup).

lhotse-speech / lhotse

Dynamic bucket selection rng sync #1341