FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer
Traceback (most recent call last):
Traceback (most recent call last):
File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/train_bash.py", line 24, in
File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/train_bash.py", line 24, in
main()
File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/train_bash.py", line 11, in main
main()
File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/train_bash.py", line 11, in main
run_sft(model_args, data_args, training_args, finetuning_args)
File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/glmtuner/tuner/sft/workflow.py", line 61, in run_sft
run_sft(model_args, data_args, training_args, finetuning_args)
File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/glmtuner/tuner/sft/workflow.py", line 61, in run_sft
train_result = trainer.train()
train_result = trainer.train() File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/transformers/trainer.py", line 1656, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
return inner_training_loop(
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/transformers/trainer.py", line 1656, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
result = tuple(
result = tuple( File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1349, in prepare_model
self.state.fsdp_plugin.set_auto_wrap_policy(model)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/utils/dataclasses.py", line 924, in set_auto_wrap_policy
return self.prepare_model(obj, device_placement=device_placement)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1349, in prepare_model
self.state.fsdp_plugin.set_auto_wrap_policy(model)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/utils/dataclasses.py", line 924, in set_auto_wrap_policy
raise Exception("Could not find the transformer layer class to wrap in the model.")
Exception: Could not find the transformer layer class to wrap in the model.
raise Exception("Could not find the transformer layer class to wrap in the model.")
Exception: Could not find the transformer layer class to wrap in the model.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5350) of binary: /usr/local/anaconda3/envs/chatglm_et/bin/python
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/chatglm_et/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/commands/launch.py", line 966, in launch_command
multi_gpu_launcher(args)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer Traceback (most recent call last): Traceback (most recent call last): File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/train_bash.py", line 24, in
File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/train_bash.py", line 24, in
main()
File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/train_bash.py", line 11, in main
main()
File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/train_bash.py", line 11, in main
run_sft(model_args, data_args, training_args, finetuning_args)
File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/glmtuner/tuner/sft/workflow.py", line 61, in run_sft
run_sft(model_args, data_args, training_args, finetuning_args)
File "/media/cfs/lizongshang/work/deep_learning/llm/ChatGLM-ETuning/src/glmtuner/tuner/sft/workflow.py", line 61, in run_sft
train_result = trainer.train()
train_result = trainer.train() File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/transformers/trainer.py", line 1656, in _inner_training_loop model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare return inner_training_loop( File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/transformers/trainer.py", line 1656, in _inner_training_loop model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare result = tuple( result = tuple( File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1349, in prepare_model
self.state.fsdp_plugin.set_auto_wrap_policy(model)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/utils/dataclasses.py", line 924, in set_auto_wrap_policy
return self.prepare_model(obj, device_placement=device_placement)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/accelerator.py", line 1349, in prepare_model
self.state.fsdp_plugin.set_auto_wrap_policy(model)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/utils/dataclasses.py", line 924, in set_auto_wrap_policy
raise Exception("Could not find the transformer layer class to wrap in the model.")
Exception: Could not find the transformer layer class to wrap in the model.
raise Exception("Could not find the transformer layer class to wrap in the model.")
Exception: Could not find the transformer layer class to wrap in the model.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5350) of binary: /usr/local/anaconda3/envs/chatglm_et/bin/python
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/chatglm_et/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/commands/launch.py", line 966, in launch_command
multi_gpu_launcher(args)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/anaconda3/envs/chatglm_et/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
src/train_bash.py FAILED
Failures: [1]: time : 2023-08-02_14:18:05 host : nb-lizongshang-lzs-im2.ea-app-headless-service.techdata-gxjs-yyjs-cd.svc.bcc-lf2.jd.local rank : 1 (local_rank: 1) exitcode : 1 (pid: 5351) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-08-02_14:18:05 host : nb-lizongshang-lzs-im2.ea-app-headless-service.techdata-gxjs-yyjs-cd.svc.bcc-lf2.jd.local rank : 0 (local_rank: 0) exitcode : 1 (pid: 5350) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html