modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
4.44k stars 491 forks source link

微调会自动删除ep文件, 导致微调结束后找不到需要ep文件 #1668

Open bird-9 opened 2 months ago

bird-9 commented 2 months ago

🐛 Bug

微调会自动删除ep文件, 导致微调结束后找不到需要ep文件

Code sample

训练参数

torchrun \
--nnodes 1 \
--node_rank 0 \
--nproc_per_node ${gpu_num} \
../../../funasr/bin/train.py \
++model="${model_name_or_model_dir}" \
++train_data_set_list="${train_data}" \
++valid_data_set_list="${val_data}" \
++dataset_conf.batch_size=40000 \
++dataset_conf.batch_type="token" \
++dataset_conf.num_workers=8 \
++train_conf.max_epoch=100 \
++train_conf.log_interval=1 \
++train_conf.resume=false \
++train_conf.validate_interval=2000 \
++train_conf.save_checkpoint_interval=2000 \
++train_conf.keep_nbest_models=20 \
++train_conf.avg_nbest_model=10 \
++optim_conf.lr=0.0002 \
++output_dir="${output_dir}" &> ${log_file}

报错日志:

查看outputs最后只保留了20个ep文件,导致Checkpoint file not found

[2024-04-26 07:54:21,218][root][INFO] - Update best acc: 0.1071, /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.best
[2024-04-26 07:54:21,220][root][INFO] - Delete: /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep80  训练的时候他会删除一些ep文件
[2024-04-26 07:54:21,367][root][INFO] - rank: 0, time_escaped_epoch: 0.014 hours, estimated to finish 100 epoch: 0.000 hours

average_checkpoints: ['/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep0', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep1', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep2', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep3', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep4', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep5', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep6', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep7', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep8', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep9']
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep0 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep1 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep2 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep3 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep4 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep5 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep6 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep7 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep8 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep9 not found.

Expected behavior

Environment

image

chenmiaotian commented 2 months ago

我也出现了,本来是有的,被删除了,你那解决了没 Checkpoint file ./outputs/model.pt.ep1 not found. Checkpoint file ./outputs/model.pt.ep2 not found. Checkpoint file ./outputs/model.pt.ep3 not found. Checkpoint file ./outputs/model.pt.ep4 not found. Checkpoint file ./outputs/model.pt.ep5 not found. Checkpoint file ./outputs/model.pt.ep6 not found. Checkpoint file ./outputs/model.pt.ep7 not found. Checkpoint file ./outputs/model.pt.ep8 not found. Checkpoint file ./outputs/model.pt.ep9 not found. Checkpoint file ./outputs/model.pt.ep10 not found. Error executing job with overrides: ['++model=iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch', '++train_data_set_list=data/train.jsonl', '++valid_data_set_list=data/val.jsonl', '++dataset_conf.batch_size=20000', '++dataset_conf.batch_type=token', '++dataset_conf.num_workers=4', '++train_conf.max_epoch=50', '++train_conf.log_interval=1', '++train_conf.resume=false', '++train_conf.validate_interval=2000', '++train_conf.save_checkpoint_interval=2000', '++train_conf.keep_nbest_models=20', '++train_conf.avg_nbest_model=10', '++optim_conf.lr=0.0002', '++output_dir=./outputs'] Traceback (most recent call last): File "/mnt/workspace/FunASR/funasr/bin/train.py", line 250, in main_hydra() File "/opt/conda/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main _run_hydra( File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra _run_app( File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app run_and_report( File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report raise ex File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report return func() File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in lambda: hydra.run( File "/opt/conda/lib/python3.10/site-packages/hydra/internal/hydra.py", line 132, in run = ret.return_value File "/opt/conda/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/opt/conda/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) File "/mnt/workspace/FunASR/funasr/bin/train.py", line 51, in main_hydra main(*kwargs) File "/mnt/workspace/FunASR/funasr/bin/train.py", line 244, in main average_checkpoints(trainer.output_dir, trainer.avg_nbest_model) File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "/mnt/workspace/FunASR/funasr/train_utils/average_nbest_models.py", line 65, in average_checkpoints raise RuntimeError("No checkpoints found for averaging.") RuntimeError: No checkpoints found for averaging.

LauraGPT commented 1 month ago

try to keep ++train_conf.keep_nbest_models equals ++train_conf.avg_nbest_model.

chenmiaotian commented 1 month ago

try to keep ++train_conf.keep_nbest_models equals ++train_conf.avg_nbest_model.

下面这3个参数的值必须一样吗,我试过如果下面这样还是会报之前上面的错误 max_epoch=50 keep_nbest_models=20 avg_nbest_model=20