NCCL error when speed perturbation

Sometimes training with Speed perturbation raises an error:

/tmp/pip-req-build-l8enafal/aten/src/THC/THCTensorIndex.cu:307: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<signed long, IndexType>, int, int, IndexType, signed long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [1,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/pip-req-build-l8enafal/aten/src/THC/THCTensorIndex.cu:307: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<signed long, IndexType>, int, int, IndexType, signed long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [1,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
... (hundrets of simmilar lines) ...
/tmp/pip-req-build-l8enafal/aten/src/THC/THCTensorIndex.cu:307: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<signed long, IndexType>, int, int, IndexType, signed long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [3,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
File "train.py", line 422, in <module>
main(args)
File "train.py", line 385, in main
args=args)
File "train.py", line 205, in train
t_predictions_t = greedy_decoder.decode(t_audio_signal_t, t_a_sig_length_t)
File "/workspace/jasper/decoders.py", line 84, in decode
sentence = self._greedy_decode(inseq, logitlen)
File "/workspace/jasper/decoders.py", line 115, in _greedy_decode
k = k.item()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:29, unhandled cuda error
training_batch_WER: 1.0
Prediction: dd
Reference: who lying on her back was scratching his nose
Loss@Step: 24 ::::::: 821.5457763671875
Step time: 3.136981248855591 seconds
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:29, unhandled cuda error

repro on commit https://github.com/ryanleary/mlperf-rnnt-ref/commit/a1112ec:

sed -i 's|speed_perturbation = .*|speed_perturbation = true|' configs/rnnt.toml
python3 -m multiproc --nproc_per_node=8 train.py --batch_size=8 --eval_batch_size=2 --num_epochs=100 --output_dir=/results --model_toml=configs/rnnt.toml --lr=0.011 --seed=6 --optimizer=adam --dataset_dir=/datasets/LibriSpeech --val_manifest=/datasets/LibriSpeech/librispeech-dev-clean-wav.json --train_manifest=/datasets/LibriSpeech/librispeech-train-clean-100-wav.json,/datasets/LibriSpeech/librispeech-train-clean-360-wav.json,/datasets/LibriSpeech/librispeech-train-other-500-wav.json --weight_decay=0.001 --save_freq=10 --eval_freq=1000 --train_freq=25 --gradient_accumulation_steps=1 --fp16 --cudnn --tb_path /home/samgd/logs/rnnt/repro/full_tr_spec_drop_spd/LR0.011_BS8_adam_ACC1_a1112ec/18:56:411575655001 2>&1 | sed '/UserWarning/d'

ryanleary / mlperf-rnnt-ref

NCCL error when speed perturbation #8