单机多卡训练，显存只分配到了0卡，其余卡没有显存占用

modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.

https://www.funasr.com

Other

6.49k stars 688 forks source link

单机多卡训练，显存只分配到了0卡，其余卡没有显存占用 #615

Closed JianweiSun007 closed 1 year ago

JianweiSun007 commented 1 year ago

如题，使用FunASR/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/这个目录下的例子进行单机多卡训练 CUDA_VISIBLE_DEVICES=1,2 python -m torch.distributed.launch --nproc_per_node 2 --use_env finetune.py显存报错，然后调试发现显存都集中在了0卡，其余一张显卡未被利用，请问是什么原因导致的？

hnluo commented 1 year ago

Please ask your question in the following format OS: [e.g. linux] Python/C++ Version： Package Version：pytorch、torchaudio、modelscope、funasr version（pip list） Model： Command： Details： Error log：

apple2333cream commented 7 months ago

我也碰到这个问题了，换成torchrun的方式，也一样全部在0卡，有大佬成功解决过这个问题的吗？