modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
6.5k stars 688 forks source link

多GPU并发训练的问题 #1145

Closed xyx361100238 closed 7 months ago

xyx361100238 commented 10 months ago

你好,我使用官方提供的脚本(finetune.py): import os

from modelscope.metainfo import Trainers from modelscope.trainers import build_trainer

from funasr.datasets.ms_dataset import MsDataset from funasr.utils.modelscope_param import modelscope_args

def modelscope_finetune(params): if not os.path.exists(params.output_dir): os.makedirs(params.output_dir, exist_ok=True)

dataset split ["train", "validation"]

ds_dict = MsDataset.load(params.data_path)
kwargs = dict(
    model=params.model,
    data_dir=ds_dict,
    dataset_type=params.dataset_type,
    work_dir=params.output_dir,
    batch_bins=params.batch_bins,
    max_epoch=params.max_epoch,
    lr=params.lr,
    mate_params=params.param_dict)
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
trainer.train()

if name == 'main': params = modelscope_args(model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch") params.output_dir = "./tmp" # m模型保存路径 params.data_path = "001-train_test" # 数据路径 params.dataset_type = "small" # 小数据量设置small,若数据量大于1000小时,请使用large params.batch_bins = 200 # batch size,如果dataset_type="small",batch_bins单位为fbank特征帧数,如果dataset_type="large",batch_bins单位为毫秒, params.max_epoch = 20 # 最大训练轮数 params.lr = 0.00005 # 设置学习率 init_param = [] # 初始模型路径,默认加载modelscope模型初始化,例如: ["checkpoint/20epoch.pb"] freeze_param = [] # 模型参数freeze, 例如: ["encoder"] ignore_init_mismatch = True # 是否忽略模型参数初始化不匹配 use_lora = False # 是否使用lora进行模型微调 params.param_dict = {"init_param":init_param, "freeze_param": freeze_param, "ignore_init_mismatch": ignore_init_mismatch} if use_lora: enable_lora = True lora_bias = "all" lora_params = {"lora_list":['q','v'], "lora_rank":8, "lora_alpha":16, "lora_dropout":0.1} lora_config = {"enable_lora": enable_lora, "lora_bias": lora_bias, "lora_params": lora_params} params.param_dict.update(lora_config)

modelscope_finetune(params)

单GPU运行时没有问题的:python3 finetune.py

但是按照教程CUDA_VISIBLE_DEVICES=1,2 python -m torch.distributed.launch --nproc_per_node 2 finetune.py(由于版本问题我实际使用的命令:CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node 2 finetune.py) 模型加载全部集中在0号GPU上,导致OOM,请问这个要怎么解?多谢

xyx361100238 commented 10 months ago

你好 使用最新重构后的代码执行并发,依然还是集中在单卡上运行

dsh54054 commented 7 months ago

你好 使用最新重构后的代码执行并发,依然还是集中在单卡上运行

hi,请问一下你解决这个问题了么,我现在也遇到了这个问题,想知道解决方案

xyx361100238 commented 7 months ago

没有解决,社区不太重视这个问题,主要支持在模型的部署应用上 @LauraGPT

LauraGPT commented 7 months ago

https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining/paraformer

xyx361100238 commented 7 months ago

现在不是并发训练集中在单卡的问题,而是训练起来之后显示不对(训练不对): image 关联到issue1273 @LauraGPT

Zomun commented 7 months ago

funasr.datasets.ms_dataset 我这个包说没有 请问是用哪个funasr版本微调

xyx361100238 commented 7 months ago

安装最新的版本(torch要求2.0以上),然后根据主页要求安装funasr modelscope就可以了。我上面的显示在加大数据量之后哦就正常