modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
6.49k stars 688 forks source link

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate #902

Closed OKORKO closed 7 months ago

OKORKO commented 1 year ago

OS: ubuntu20.04 Python/C++ Version:python3.9 Package Version:torch-1.13.1+cu117、torchaudio-0.13.1+cu117、modelscope-1.8.、funasr-0.7.4(pip list) Model:speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch Command: python finetune.py Details:单机 4个GPU 训练Paraformer-large模型,batch-bin只设置了50,开始训练没问题,过一段时间,GPU显存就会不足 error log 加了 os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"

if hasattr(torch.cuda, 'empty_cache'): torch.cuda.empty_cache() trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs) trainer.train()

[chushaobo] 2023-08-29 13:35:38,658 (build_trainer:733) INFO: 1epoch:train:17551-17600batch:17600num_updates: iter_time=1.497e-04, forward_time=0.158, loss_att=0.256, acc=0.808, loss_pre=0.076, loss=0.332, backward_time=0.618, optim_step_time=0.097, optim0_lr0=2.929e-05, train_time=1.072 [chushaobo] 2023-08-29 13:36:28,588 (build_trainer:733) INFO: 1epoch:train:17601-17650batch:17650num_updates: iter_time=1.484e-04, forward_time=0.161, loss_att=0.253, acc=0.819, loss_pre=0.092, loss=0.345, backward_time=0.629, optim_step_time=0.099, optim0_lr0=2.938e-05, train_time=0.999 [chushaobo] 2023-08-29 13:37:18,080 (build_trainer:733) INFO: 1epoch:train:17651-17700batch:17700num_updates: iter_time=1.495e-04, forward_time=0.153, loss_att=0.283, acc=0.761, loss_pre=0.091, loss=0.375, backward_time=0.625, optim_step_time=0.097, optim0_lr0=2.946e-05, train_time=0.990 [chushaobo] 2023-08-29 13:38:07,060 (build_trainer:733) INFO: 1epoch:train:17701-17750batch:17750num_updates: iter_time=1.418e-04, forward_time=0.151, loss_att=0.336, acc=0.802, loss_pre=0.083, loss=0.419, backward_time=0.618, optim_step_time=0.096, optim0_lr0=2.954e-05, train_time=0.980 [chushaobo] 2023-08-29 13:38:56,473 (build_trainer:733) INFO: 1epoch:train:17751-17800batch:17800num_updates: iter_time=1.426e-04, forward_time=0.157, loss_att=0.276, acc=0.803, loss_pre=0.069, loss=0.345, backward_time=0.625, optim_step_time=0.098, optim0_lr0=2.963e-05, train_time=0.988 Traceback (most recent call last): File "/home/chushaobo/project/FunASR/mytrain.py", line 107, in modelscope_finetune(params) File "/home/chushaobo/project/FunASR/mytrain.py", line 88, in modelscope_finetune trainer.train() File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/modelscope/trainers/audio/asr_trainer.py", line 168, in train self.trainer.run() File "/home/chushaobo/project/FunASR/funasr/build_utils/build_trainer.py", line 266, in run all_steps_are_invalid, max_update_stop = self.train_one_epoch( File "/home/chushaobo/project/FunASR/funasr/build_utils/build_trainer.py", line 564, in train_one_epoch retval = model(batch) File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(inputs, kwargs) File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1003, in _run_ddp_forward return module_to_run(*inputs, kwargs) File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/chushaobo/project/FunASR/funasr/models/e2e_asr_paraformer.py", line 183, in forward encoder_out, encoder_out_lens = self.encode(speech, speech_lengths) File "/home/chushaobo/project/FunASR/funasr/models/e2e_asr_paraformer.py", line 325, in encode encoder_out, encoder_outlens, = self.encoder(feats, feats_lengths) File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/chushaobo/project/FunASR/funasr/models/encoder/sanm_encoder.py", line 337, in forward encoder_outs = self.encoders(xs_pad, masks) File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/chushaobo/project/FunASR/funasr/modules/repeat.py", line 32, in forward args = m(args) File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/chushaobo/project/FunASR/funasr/models/encoder/sanm_encoder.py", line 101, in forward self.self_attn(x, mask, mask_shfit_chunk=mask_shfit_chunk, mask_att_chunk_encoder=mask_att_chunk_encoder) File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/chushaobo/project/FunASR/funasr/modules/attention.py", line 455, in forward scores = torch.matmul(q_h, k_h.transpose(-2, -1)) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 0; 10.76 GiB total capacity; 9.21 GiB already allocated; 5.81 MiB free; 9.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 493326 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 493327 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 493328 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 493325) of binary: /home/chushaobo/anaconda3/envs/funasr/bin/python Traceback (most recent call last): File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in main() File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/chushaobo/anaconda3/envs/funasr/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

mytrain.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-08-29_13:39:43 host : chushaobo rank : 0 (local_rank: 0) exitcode : 1 (pid: 493325) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
hnluo commented 1 year ago

Try to run egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/finetune.py, set dataset_type = "large" and batch_bins = 60000