Closed svenha closed 6 years ago
I had to decrease num-chunk-per-minibatch 288 to num-chunk-per-minibatch 96 in run-chain.sh (144 was too big), too. With these two changes, run-chain.sh finished after 10 days (tdnn_f alone took 6 of them.)
But it's worth it:
%WER 11.67 [ 20564 / 176256, 3497 ins, 2716 del, 14351 sub] tdnn_sp/decode_test/wer_9_0.5
%WER 13.27 [ 23387 / 176256, 3544 ins, 3313 del, 16530 sub ] tdnn_250/decode_test/wer_8_1.0
%WER 8.60 [ 15166 / 176256, 2617 ins, 2104 del, 10445 sub ] tdnn_f/decode_test/wer_10_1.0
I am using a nvidia GTX 1080 Ti GPU for my model builds:
02:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
I retrained a German model, but it runs out of GPU RAM in:
nnet3-chain-train --use-gpu=wait --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=exp/nnet3_chain/tdnn_250/cache.1 --write-cache=exp/nnet3_chain/tdnn_250/cache.2 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=1.0 --srand=1 "nnet3-am-copy --raw=true --learning-rate=0.000998730064726 --scale=0.980025398705 exp/nnet3_chain/tdnn_250/1.mdl - |" exp/nnet3_chain/tdnn_250/den.fst "ark,bg:nnet3-chain-copy-egs --frame-shift=2 ark:exp/nnet3_chain/tdnn_250/egs/cegs.2.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=1 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=512 ark:- ark:- |" exp/nnet3_chain/tdnn_250/2.1.raw
I have 4 GB GPU RAM, which is only used by kaldi in exclusive mode. This was enough back in June. What are your experiences or recommendations?
The error message contains this:
ERROR (nnet3-chain-train[5.5.95~1-4bdb]:AllocateNewRegion():cu-allocator.cc:513) Failed to allocate a memory region of 356515840 bytes. Possibly smaller minibatch size would help. Memory info: free:190M, used:3844M, total:4035M, free/total:0.0473123
So, should I retry with --minibatch-size=384 (instead of 512)? This value makes this step complete, but I probably have to rerun all steps to be sure.