hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
34.45k stars 4.25k forks source link

对微调后的GLM-4-9B-Chat运行examples/train_lora/llama3_lora_predict.yaml出错 #5447

Open Twilightsh opened 2 months ago

Twilightsh commented 2 months ago

Reminder

System Info

Reproduction

参数

model

model_name_or_path: 】model/glm-4-9b-chat adapter_name_or_path: saves/glm49bchat/term/lora/sft/test

method

stage: sft do_predict: true finetuning_type: lora

dataset

eval_dataset: identity,alpaca_en_demo template: glm4 cutoff_len: 1024 max_samples: 50 overwrite_cache: False preprocessing_num_workers: 16

output

output_dir: score/glm49bchat/term/lora/sft/test overwrite_output_dir: true

eval

per_device_eval_batch_size: 1 predict_with_generate: true ddp_timeout: 180000000

错误信息 Loading checkpoint shards: 10%|████▌ | 1/10 [00:00<00:06, 1.47it/s]09/16/2024 12:38:24 - INFO - llamafactory.model.patcher - Using KV cache for faster generation. Loading checkpoint shards: 100%|████████████████████████████████████████████| 10/10 [00:06<00:00, 1.46it/s] [INFO|modeling_utils.py:4507] 2024-09-16 12:38:30,387 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

[INFO|modeling_utils.py:4515] 2024-09-16 12:38:30,387 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at /home/sunhao/glm-4-9b-chat. If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training. [INFO|configuration_utils.py:991] 2024-09-16 12:38:30,390 >> loading configuration file /home/sunhao/glm-4-9b-chat/generation_config.json [INFO|configuration_utils.py:1038] 2024-09-16 12:38:30,391 >> Generate config GenerationConfig { "do_sample": true, "eos_token_id": [ 151329, 151336, 151338 ], "max_length": 128000, "pad_token_id": 151329, "temperature": 0.8, "top_p": 0.8 }

09/16/2024 12:38:30 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation. 09/16/2024 12:38:30 - INFO - llamafactory.model.adapter - Merged 1 adapter(s). 09/16/2024 12:38:30 - INFO - llamafactory.model.adapter - Loaded adapter(s): saves/glm49bchat/term/lora/sft/test 09/16/2024 12:38:30 - INFO - llamafactory.model.loader - all params: 9,399,951,360 Loading checkpoint shards: 90%|████████████████████████████████████████▌ | 9/10 [00:06<00:00, 1.44it/s][INFO|trainer.py:3819] 2024-09-16 12:38:31,279 >> Running Prediction [INFO|trainer.py:3821] 2024-09-16 12:38:31,279 >> Num examples = 100 [INFO|trainer.py:3824] 2024-09-16 12:38:31,279 >> Batch size = 1 Loading checkpoint shards: 100%|████████████████████████████████████████████| 10/10 [00:06<00:00, 1.44it/s] 09/16/2024 12:38:31 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation. 09/16/2024 12:38:32 - INFO - llamafactory.model.adapter - Merged 1 adapter(s). 09/16/2024 12:38:32 - INFO - llamafactory.model.adapter - Loaded adapter(s): saves/glm49bchat/term/lora/sft/test 09/16/2024 12:38:32 - INFO - llamafactory.model.loader - all params: 9,399,951,360 rank0: Traceback (most recent call last): rank0: File "/home/sunhao/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in

rank0: File "/home/sunhao/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch

rank0: File "/home/sunhao/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp rank0: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) rank0: File "/home/sunhao/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 117, in run_sft rank0: predict_results = trainer.predict(dataset_module["eval_dataset"], metric_key_prefix="predict", **gen_kwargs)

rank0: File "/home/sunhao/anaconda3/envs/LLaMA-Factory/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 244, in predict rank0: return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)

rank0: File "/home/sunhao/anaconda3/envs/LLaMA-Factory/lib/python3.11/site-packages/transformers/trainer.py", line 3744, in predict rank0: output = eval_loop(

rank0: File "/home/sunhao/anaconda3/envs/LLaMA-Factory/lib/python3.11/site-packages/transformers/trainer.py", line 3857, in evaluation_loop rank0: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)

rank0: File "/home/sunhao/LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 104, in prediction_step rank0: loss, generatedtokens, = super().prediction_step( # ignore the returned labels (may be truncated)

rank0: File "/home/sunhao/anaconda3/envs/LLaMA-Factory/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 310, in prediction_step rank0: generated_tokens = self.model.generate(generation_inputs, gen_kwargs)

rank0: File "/home/sunhao/anaconda3/envs/LLaMA-Factory/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank0: return func(*args, **kwargs)

rank0: File "/home/sunhao/anaconda3/envs/LLaMA-Factory/lib/python3.11/site-packages/transformers/generation/utils.py", line 2024, in generate rank0: result = self._sample(

rank0: File "/home/sunhao/anaconda3/envs/LLaMA-Factory/lib/python3.11/site-packages/transformers/generation/utils.py", line 3032, in _sample rank0: model_kwargs = self._update_model_kwargs_for_generation(

rank0: File "/home/sunhao/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 812, in _update_model_kwargs_for_generation rank0: model_kwargs["past_key_values"] = self._extract_past_from_model_output(

rank0: TypeError: GenerationMixin._extract_past_from_model_output() got an unexpected keyword argument 'standardize_cache_format' W0916 12:38:33.499000 140091173263168 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1022548 closing signal SIGTERM E0916 12:38:33.914000 140091173263168 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1022547) of binary: /home/sunhao/anaconda3/envs/LLaMA-Factory/bin/python Traceback (most recent call last): File "/home/sunhao/anaconda3/envs/LLaMA-Factory/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/home/sunhao/anaconda3/envs/LLaMA-Factory/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/sunhao/anaconda3/envs/LLaMA-Factory/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/sunhao/anaconda3/envs/LLaMA-Factory/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/sunhao/anaconda3/envs/LLaMA-Factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/sunhao/anaconda3/envs/LLaMA-Factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/sunhao/LLaMA-Factory/src/llamafactory/launcher.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-09-16_12:38:33 host : lab402 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1022547) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ### Expected behavior _No response_ ### Others _No response_
2551120365 commented 1 month ago

我也出现了这个问题,请问您解决了吗?