ryanleary / mlperf-rnnt-ref

Other
3 stars 1 forks source link

CUDNN warnings when BatchNormalization is used #6

Closed mwawrzos closed 4 years ago

mwawrzos commented 4 years ago

The warning is a problem when full training is running, as the log is getting huuuge (400 MB instead of 2MB)

The warning is proposing a fix: To compact weights again call flatten_parameters(). I am not sure, where to call it yet.

Full warning below:

warning:
../aten/src/ATen/native/cudnn/RNN.cpp:1278: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().

Repro for commit https://github.com/ryanleary/mlperf-rnnt-ref/commit/4082f086ec4834886cceb927dbb1454eca44c68d:

train.py --batch_size=16 --eval_batch_size=4 --num_epochs=1000 --output_dir=/results --model_toml=configs/rnnt_bn.toml --lr=0.02 --seed=6 --optimizer=novograd --dataset_dir=/datasets/LibriSpeech --val_manifest=/datasets/LibriSpeech/librispeech-dev-clean-wav.json --train_manifest=/datasets/LibriSpeech/librispeech-train-clean-100-wav.json,/datasets/LibriSpeech/librispeech-train-clean-360-wav.json,/datasets/LibriSpeech/librispeech-train-other-500-wav.json --weight_decay=0.001 --save_freq=10 --eval_freq=1000 --train_freq=25 --gradient_accumulation_steps=4 --fp16 --cudnn
mwawrzos commented 4 years ago

According to a post in the StackOverflow, batch normalization probably will not be used