Closed mwawrzos closed 4 years ago
Sometimes training with Speed perturbation raises an error:
/tmp/pip-req-build-l8enafal/aten/src/THC/THCTensorIndex.cu:307: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<signed long, IndexType>, int, int, IndexType, signed long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [1,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed. /tmp/pip-req-build-l8enafal/aten/src/THC/THCTensorIndex.cu:307: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<signed long, IndexType>, int, int, IndexType, signed long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [1,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ... (hundrets of simmilar lines) ... /tmp/pip-req-build-l8enafal/aten/src/THC/THCTensorIndex.cu:307: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<signed long, IndexType>, int, int, IndexType, signed long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [3,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed. Traceback (most recent call last): File "train.py", line 422, in <module> main(args) File "train.py", line 385, in main args=args) File "train.py", line 205, in train t_predictions_t = greedy_decoder.decode(t_audio_signal_t, t_a_sig_length_t) File "/workspace/jasper/decoders.py", line 84, in decode sentence = self._greedy_decode(inseq, logitlen) File "/workspace/jasper/decoders.py", line 115, in _greedy_decode k = k.item() RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:29, unhandled cuda error training_batch_WER: 1.0 Prediction: dd Reference: who lying on her back was scratching his nose Loss@Step: 24 ::::::: 821.5457763671875 Step time: 3.136981248855591 seconds terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:29, unhandled cuda error
repro on commit https://github.com/ryanleary/mlperf-rnnt-ref/commit/a1112ec:
sed -i 's|speed_perturbation = .*|speed_perturbation = true|' configs/rnnt.toml python3 -m multiproc --nproc_per_node=8 train.py --batch_size=8 --eval_batch_size=2 --num_epochs=100 --output_dir=/results --model_toml=configs/rnnt.toml --lr=0.011 --seed=6 --optimizer=adam --dataset_dir=/datasets/LibriSpeech --val_manifest=/datasets/LibriSpeech/librispeech-dev-clean-wav.json --train_manifest=/datasets/LibriSpeech/librispeech-train-clean-100-wav.json,/datasets/LibriSpeech/librispeech-train-clean-360-wav.json,/datasets/LibriSpeech/librispeech-train-other-500-wav.json --weight_decay=0.001 --save_freq=10 --eval_freq=1000 --train_freq=25 --gradient_accumulation_steps=1 --fp16 --cudnn --tb_path /home/samgd/logs/rnnt/repro/full_tr_spec_drop_spd/LR0.011_BS8_adam_ACC1_a1112ec/18:56:411575655001 2>&1 | sed '/UserWarning/d'
The model was incorrectly constructed here: https://github.com/ryanleary/mlperf-rnnt-ref/commit/c25fff6be370317d0c43fcc7303bbfa8eb405319#diff-a343f24f3d4069dfa06ca129b1ca8d0eR253
Sometimes training with Speed perturbation raises an error:
repro on commit https://github.com/ryanleary/mlperf-rnnt-ref/commit/a1112ec: