modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
4.04k stars 357 forks source link

求助!!!,微调的时候遇到NotImplementedError: data_seed requires Accelerate version `accelerate` >= 1.1.0. This is not supported and we recommend you to update your version #2277

Closed VeitchG closed 4 days ago

VeitchG commented 1 week ago

求助!!!,微调的时候遇到NotImplementedError: data_seed requires Accelerate version accelerate >= 1.1.0. This is not supported and we recommend you to update your version,但是accelerate当前最新的就1.0.1

VeitchG commented 1 week ago

运行脚本过程中有如下错误,亲各位大佬帮忙看看,run sh: /root/miniconda3/envs/Qwen2-VL/bin/python -m torch.distributed.run --nproc_per_node 2 /root/Qwen2-VL/ms-swift/swift/cli/sft.py --model_type qwen2-vl-7b-instruct --model_id_or_path qwen/Qwen2-VL-7B-Instruct --sft_type lora --dataset test1.jsonl#20000 --deepspeed default-zero3 WARNING:main:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[INFO:swift] Successfully registered /root/Qwen2-VL/ms-swift/swift/llm/data/dataset_info.json /root/miniconda3/envs/Qwen2-VL/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm._version' from vllm.version import version as VLLM_VERSION /root/miniconda3/envs/Qwen2-VL/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm._version' from vllm.version import version as VLLM_VERSION [ERROR:swift] import vllm_utils error: Invalid version: 'dev' [INFO:swift] No LMDeploy installed, if you are using LMDeploy, you will get ImportError: cannot import name 'prepare_lmdeploy_engine_template' from 'swift.llm' [ERROR:swift] import vllm_utils error: Invalid version: 'dev' [INFO:swift] Start time of running main: 2024-10-17 19:17:59.587550 [INFO:swift] Setting template_type: qwen2-vl [INFO:swift] Using deepspeed: {'fp16': {'enabled': 'auto', 'loss_scale': 0, 'loss_scale_window': 1000, 'initial_scale_power': 16, 'hysteresis': 2, 'min_loss_scale': 1}, 'bf16': {'enabled': 'auto'}, 'optimizer': {'type': 'AdamW', 'params': {'lr': 'auto', 'betas': 'auto', 'eps': 'auto', 'weight_decay': 'auto'}}, 'scheduler': {'type': 'WarmupCosineLR', 'params': {'total_num_steps': 'auto', 'warmup_num_steps': 'auto'}}, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'none', 'pin_memory': True}, 'offload_param': {'device': 'none', 'pin_memory': True}, 'overlap_comm': True, 'contiguous_gradients': True, 'sub_group_size': 1000000000.0, 'reduce_bucket_size': 'auto', 'stage3_prefetch_bucket_size': 'auto', 'stage3_param_persistence_threshold': 'auto', 'stage3_max_live_parameters': 1000000000.0, 'stage3_max_reuse_distance': 1000000000.0, 'stage3_gather_16bit_weights_on_model_save': True}, 'gradient_accumulation_steps': 'auto', 'gradient_clipping': 'auto', 'steps_per_print': 2000, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'wall_clock_breakdown': False} [INFO:swift] Setting args.lazy_tokenize: True [INFO:swift] Setting args.dataloader_num_workers: 1 [2024-10-17 19:17:59,804] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-17 19:17:59,935] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-17 19:18:00,460] [INFO] [comm.py:652:init_distributed] cdb=None [2024-10-17 19:18:00,460] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl rank0: Traceback (most recent call last): rank0: File "/root/Qwen2-VL/ms-swift/swift/cli/sft.py", line 5, in

rank0: File "/root/Qwen2-VL/ms-swift/swift/utils/run_utils.py", line 22, in x_main rank0: args, remaining_argv = parse_args(args_class, argv)

rank0: File "/root/Qwen2-VL/ms-swift/swift/utils/utils.py", line 131, in parse_args rank0: args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)

rank0: File "/root/miniconda3/envs/Qwen2-VL/lib/python3.12/site-packages/transformers/hf_argparser.py", line 352, in parse_args_into_dataclasses rank0: obj = dtype(**inputs)

rank0: File "", line 215, in init rank0: File "/root/Qwen2-VL/ms-swift/swift/llm/utils/argument.py", line 1151, in __post_init__

rank0: File "/root/Qwen2-VL/ms-swift/swift/llm/utils/argument.py", line 1203, in _init_training_args rank0: training_args = training_args_cls(

rank0: File "", line 144, in init rank0: File "/root/Qwen2-VL/ms-swift/swift/trainers/arguments.py", line 39, in __post_init__

rank0: File "/root/miniconda3/envs/Qwen2-VL/lib/python3.12/site-packages/transformers/training_args.py", line 2083, in __post_init__ rank0: raise NotImplementedError( rank0: NotImplementedError: data_seed requires Accelerate version accelerate >= 1.1.0. This is not supported and we recommend you to update your version. [2024-10-17 19:18:00,609] [INFO] [comm.py:652:init_distributed] cdb=None rank1: Traceback (most recent call last): rank1: File "/root/Qwen2-VL/ms-swift/swift/cli/sft.py", line 5, in

rank1: File "/root/Qwen2-VL/ms-swift/swift/utils/run_utils.py", line 22, in x_main rank1: args, remaining_argv = parse_args(args_class, argv)

rank1: File "/root/Qwen2-VL/ms-swift/swift/utils/utils.py", line 131, in parse_args rank1: args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)

rank1: File "/root/miniconda3/envs/Qwen2-VL/lib/python3.12/site-packages/transformers/hf_argparser.py", line 352, in parse_args_into_dataclasses rank1: obj = dtype(**inputs)

rank1: File "", line 215, in init rank1: File "/root/Qwen2-VL/ms-swift/swift/llm/utils/argument.py", line 1151, in __post_init__

rank1: File "/root/Qwen2-VL/ms-swift/swift/llm/utils/argument.py", line 1203, in _init_training_args rank1: training_args = training_args_cls(

rank1: File "", line 144, in init rank1: File "/root/Qwen2-VL/ms-swift/swift/trainers/arguments.py", line 39, in __post_init__

rank1: File "/root/miniconda3/envs/Qwen2-VL/lib/python3.12/site-packages/transformers/training_args.py", line 2083, in post_init rank1: raise NotImplementedError( rank1: NotImplementedError: data_seed requires Accelerate version accelerate >= 1.1.0. This is not supported and we recommend you to update your version. W1017 19:18:01.449000 140032046114624 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 43833 closing signal SIGTERM E1017 19:18:01.563000 140032046114624 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 43832) of binary: /root/miniconda3/envs/Qwen2-VL/bin/python Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/root/miniconda3/envs/Qwen2-VL/lib/python3.12/site-packages/torch/distributed/run.py", line 905, in main() File "/root/miniconda3/envs/Qwen2-VL/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init__.py", line 348, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/Qwen2-VL/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/root/miniconda3/envs/Qwen2-VL/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/root/miniconda3/envs/Qwen2-VL/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/Qwen2-VL/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/Qwen2-VL/ms-swift/swift/cli/sft.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-17_19:18:01 host : instance-mmufurso rank : 0 (local_rank: 0) exitcode : 1 (pid: 43832) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Echo0125 commented 1 week ago

你的transformers版本太高了,4.52.2就ok

lemonzjk commented 5 days ago

没有4.52.2啊也,最大4.46.0

Jintao-Huang commented 5 days ago

https://github.com/modelscope/ms-swift/issues/2339