hiyouga / ChatGLM-Efficient-Tuning

Fine-tuning ChatGLM-6B with PEFT | 基于 PEFT 的高效 ChatGLM 微调
Apache License 2.0
3.66k stars 471 forks source link

多卡微调错误:Exception: Could not find the transformer layer class to wrap in the model. #378

Closed lrx1213 closed 1 year ago

lrx1213 commented 1 year ago

FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer Traceback (most recent call last): Traceback (most recent call last): File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/train_bash.py", line 24, in File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/train_bash.py", line 24, in main() File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/train_bash.py", line 11, in main main() File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/train_bash.py", line 11, in main run_sft(model_args, data_args, training_args, finetuning_args) File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/glmtuner/tuner/sft/workflow.py", line 61, in run_sft run_sft(model_args, data_args, training_args, finetuning_args) File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/glmtuner/tuner/sft/workflow.py", line 61, in run_sft train_result = trainer.train() train_result = trainer.train() File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train

File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/transformers/trainer.py", line 1656, in _inner_training_loop model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare return inner_training_loop( File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/transformers/trainer.py", line 1656, in _inner_training_loop model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare result = tuple( result = tuple( File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in

File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1349, in prepare_model self.state.fsdp_plugin.set_auto_wrap_policy(model) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/utils/dataclasses.py", line 924, in set_auto_wrap_policy return self.prepare_model(obj, device_placement=device_placement) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1349, in prepare_model self.state.fsdp_plugin.set_auto_wrap_policy(model) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/utils/dataclasses.py", line 924, in set_auto_wrap_policy raise Exception("Could not find the transformer layer class to wrap in the model.") Exception: Could not find the transformer layer class to wrap in the model. raise Exception("Could not find the transformer layer class to wrap in the model.") Exception: Could not find the transformer layer class to wrap in the model. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5350) of binary: /usr/local/anaconda3/envs/chatglm_et/bin/python Traceback (most recent call last): File "/usr/local/anaconda3/envs/chatglm_et/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/commands/launch.py", line 966, in launch_command multi_gpu_launcher(args) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/train_bash.py FAILED

Failures: [1]: time : 2023-08-02_14:18:05 host : nb-lizongshang-lzs-im2.ea-app-headless-service.techdata-gxjs-yyjs-cd.svc.bcc-lf2.jd.local rank : 1 (local_rank: 1) exitcode : 1 (pid: 5351) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-08-02_14:18:05 host : nb-lizongshang-lzs-im2.ea-app-headless-service.techdata-gxjs-yyjs-cd.svc.bcc-lf2.jd.local rank : 0 (local_rank: 0) exitcode : 1 (pid: 5350) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html