Out of memory when run egs/wsj/s5/run.sh

I use one v100 GPU with 32G memory, and seems 32G memory is not enough for the training, how to reduce the memory used by the training process?
nnet3-chain-train2 --out-of-range-regularize=0.01 --write-cache=exp/chain2_online_cmn/tdnn1i_sp/cache.33 --read-cache=exp/chain2_online_cmn/tdnn1i_sp/cache.32 --use-gpu=yes --apply-deriv-weights=false --leaky-hmm-coefficient=0.1 --xent-regularize=0.1 --print-interval=10 --max-param-change=2.0 --momentum=0.0 --l2-regularize-factor=0.25 --srand=0 'nnet3-copy --learning-rate=0.0013510133026588835 exp/chain2_online_cmn/tdnn1i_sp/32.raw - |' exp/chain2_online_cmn/tdnn1i_sp/egs/misc 'ark:nnet3-chain-copy-egs  --frame-shift=0 scp:exp/chain2_online_cmn/tdnn1i_sp/egs/train.4.scp ark:- | nnet3-chain-shuffle-egs --buffer-size=1000 --srand=32 ark:- ark:- | nnet3-chain-merge-egs  --minibatch-size=128,64 ark:- ark:-|' exp/chain2_online_cmn/tdnn1i_sp/33.2.raw
WARNING (nnet3-chain-train2[5.5.0~1-be22]:SelectGpuId():cu-device.cc:243) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (nnet3-chain-train2[5.5.0~1-be22]:SelectGpuIdAuto():cu-device.cc:438) Selecting from 8 GPUs
LOG (nnet3-chain-train2[5.5.0~1-be22]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(0): Tesla V100-PCIE-32GB        free:28113M, used:4397M, total:32510M, free/total:0.864752
LOG (nnet3-chain-train2[5.5.0~1-be22]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(1): Tesla V100-PCIE-32GB        free:28118M, used:4392M, total:32510M, free/total:0.864906
LOG (nnet3-chain-train2[5.5.0~1-be22]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(2): Tesla V100-PCIE-32GB        free:31661M, used:849M, total:32510M, free/total:0.973886
LOG (nnet3-chain-train2[5.5.0~1-be22]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(3): Tesla V100-PCIE-32GB        free:31661M, used:849M, total:32510M, free/total:0.973886
LOG (nnet3-chain-train2[5.5.0~1-be22]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(4): Tesla V100-PCIE-32GB        free:31661M, used:849M, total:32510M, free/total:0.973886
LOG (nnet3-chain-train2[5.5.0~1-be22]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(5): Tesla V100-PCIE-32GB        free:31657M, used:853M, total:32510M, free/total:0.973762
LOG (nnet3-chain-train2[5.5.0~1-be22]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(6): Tesla V100-PCIE-32GB        free:31657M, used:853M, total:32510M, free/total:0.973762
LOG (nnet3-chain-train2[5.5.0~1-be22]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(7): Tesla V100-PCIE-32GB        free:31619M, used:891M, total:32510M, free/total:0.972594
LOG (nnet3-chain-train2[5.5.0~1-be22]:SelectGpuIdAuto():cu-device.cc:501) Device: 2, mem_ratio: 0.973886
LOG (nnet3-chain-train2[5.5.0~1-be22]:SelectGpuId():cu-device.cc:382) Trying to select device: 2
LOG (nnet3-chain-train2[5.5.0~1-be22]:SelectGpuIdAuto():cu-device.cc:511) Success selecting device 2 free mem ratio: 0.973886
LOG (nnet3-chain-train2[5.5.0~1-be22]:FinalizeActiveGpu():cu-device.cc:338) The active GPU is [2]: Tesla V100-PCIE-32GB free:31466M, used:1044M, total:32510M, free/total:0.967887 version 7.0
nnet3-copy --learning-rate=0.0013510133026588835 exp/chain2_online_cmn/tdnn1i_sp/32.raw -
LOG (nnet3-chain-train2[5.5.0~1-be22]:PrintMemoryUsage():cu-allocator.cc:340) Memory usage: 0/0 bytes currently allocated/total-held; 0/0 blocks currently allocated/free; largest free/allocated block sizes are 0/0; time taken total/cudaMalloc is 0/0.0356741, synchronized the GPU 0 times out of 0 frees; device memory info: free:15732M, used:16778M, total:32510M, free/total:0.483924maximum allocated: 0current allocated: 0
ERROR (nnet3-chain-train2[5.5.0~1-be22]:AllocateNewRegion():cu-allocator.cc:491) Failed to allocate a memory region of 16498294784 bytes.  Possibly this is due to sharing the GPU.  Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py.  Memory info: free:31466M, used:1044M, total:32510M, free/total:0.967887 CUDA error: 'out of memory'

[ Stack-Trace: ]
/opt/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0xb42) [0x7fc938804722]
nnet3-chain-train2(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x559ad140bfe3]
/opt/kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuMemoryAllocator::AllocateNewRegion(unsigned long)+0x46f) [0x7fc938f98cd3]
/opt/kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuMemoryAllocator::MallocPitch(unsigned long, unsigned long, unsigned long*)+0x4b4) [0x7fc938f995d0]
/opt/kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuMatrix<float>::Resize(int, int, kaldi::MatrixResizeType, kaldi::MatrixStrideType)+0x187) [0x7fc938f56239]
/opt/kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuMatrix<float>::Swap(kaldi::Matrix<float>*)+0x6e) [0x7fc938f57622]
/opt/kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuMatrix<float>::Read(std::istream&, bool)+0x5b) [0x7fc938f577c7]
/opt/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::FixedAffineComponent::Read(std::istream&, bool)+0xae) [0x7fc93a5aa1ec]
/opt/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::Component::ReadNew(std::istream&, bool)+0xc4) [0x7fc93a5956ac]
/opt/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::Nnet::Read(std::istream&, bool)+0xca3) [0x7fc93a620db1]
nnet3-chain-train2(main+0x5d1) [0x559ad140b53b]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fc9378dbc87]
nnet3-chain-train2(_start+0x2a) [0x559ad140ae8a]

WARNING (nnet3-chain-train2[5.5.0~1-be22]:Close():kaldi-io.cc:515) Pipe nnet3-copy --learning-rate=0.0013510133026588835 exp/chain2_online_cmn/tdnn1i_sp/32.raw - | had nonzero return status 36096
kaldi::KaldiFatalError
# Accounting: time=10 threads=1
# Ended (code 255) at Wed Jan 18 10:17:28 UTC 2023, elapsed time 10 seconds
kaldi-asr / kaldi

Out of memory when run egs/wsj/s5/run.sh #4819