hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
31.82k stars 3.91k forks source link

errors while in finetune intermlm2-chat-20b with qlora #3798

Open a1exyu opened 4 months ago

a1exyu commented 4 months ago

Reminder

Reproduction

CUDA_VISIBLE_DEVICES=1 llamafactory-cli example/...... below is the yaml file:

model

model_name_or_path: /home/ybh/ybh/models/internlm2-chat-20b quantization_bit: 4

method

stage: sft do_train: true finetuning_type: lora lora_target: wqkv

dataset

dataset: text_classification_coarse template: intern2 cutoff_len: 6144 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: /home/ybh/ybh/nlpcc/LLaMA-Factory/saves/internlm2-chat-20b/qlora/sft logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 0.0001 num_train_epochs: 5.0 lr_scheduler_type: cosine warmup_steps: 0.1 fp16: true

eval

val_size: 0.1 per_device_eval_batch_size: 1 evaluation_strategy: steps eval_steps: 10

Expected behavior

No response

System Info

[INFO|trainer.py:2048] 2024-05-18 00:07:10,006 >> Running training [INFO|trainer.py:2049] 2024-05-18 00:07:10,006 >> Num examples = 122 [INFO|trainer.py:2050] 2024-05-18 00:07:10,006 >> Num Epochs = 5 [INFO|trainer.py:2051] 2024-05-18 00:07:10,006 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2054] 2024-05-18 00:07:10,006 >> Total train batch size (w. parallel, distributed & accumulation) = 8 [INFO|trainer.py:2055] 2024-05-18 00:07:10,006 >> Gradient Accumulation steps = 8 [INFO|trainer.py:2056] 2024-05-18 00:07:10,006 >> Total optimization steps = 75 [INFO|trainer.py:2057] 2024-05-18 00:07:10,007 >> Number of trainable parameters = 2,621,440 0%| | 0/75 [00:00<?, ?it/s]/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. warnings.warn( /home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( Traceback (most recent call last): File "/home/ybh/miniconda3/envs/nlpcc/bin/llamafactory-cli", line 8, in sys.exit(main()) File "/data/ybh/nlpcc/LLaMA-Factory-main/src/llamafactory/cli.py", line 65, in main run_exp() File "/data/ybh/nlpcc/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 33, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/data/ybh/nlpcc/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 73, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/transformers/trainer.py", line 3147, in training_step self.accelerator.backward(loss) File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/accelerate/accelerator.py", line 2121, in backward self.scaler.scale(loss).backward(**kwargs) File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward torch.autograd.backward( File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward _engine_run_backward( File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn 0%| | 0/75 [00:00<?, ?it/s

Others

i changed lora to finetune internlm-chat-7b, but this error is not happend.

gabriel-peracio commented 4 months ago

Yes, same here, though in my case I tried it with internlm2-20b (base, non-chat)

The same configuration, but applied to internlm2-7b, appears to work (I did not allow it to conclude as I am not interested in that model)

hiahia121 commented 1 week ago

telechat-12b the same error

09/25/2024 07:48:54 - WARNING - llamafactory.model.model_utils.checkpointing - You are using the old GC format, some features (e.g. BAdam) will be invalid. 09/25/2024 07:48:54 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled. 09/25/2024 07:48:54 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation. 09/25/2024 07:48:54 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32. 09/25/2024 07:48:54 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA 09/25/2024 07:48:54 - INFO - llamafactory.model.model_utils.misc - Found linear modules: key_value,down_proj,dense,query,gate_proj,up_proj 09/25/2024 07:48:56 - INFO - llamafactory.model.loader - trainable params: 18677760 || all params: 7218696192 || trainable%: 0.2587 [INFO|trainer.py:648] 2024-09-25 07:48:56,887 >> Using auto half precision backend [INFO|trainer.py:2134] 2024-09-25 07:48:57,289 >> Running training [INFO|trainer.py:2135] 2024-09-25 07:48:57,290 >> Num examples = 4,999 [INFO|trainer.py:2136] 2024-09-25 07:48:57,290 >> Num Epochs = 1 [INFO|trainer.py:2137] 2024-09-25 07:48:57,290 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2140] 2024-09-25 07:48:57,290 >> Total train batch size (w. parallel, distributed & accumulation) = 8 [INFO|trainer.py:2141] 2024-09-25 07:48:57,290 >> Gradient Accumulation steps = 8 [INFO|trainer.py:2142] 2024-09-25 07:48:57,290 >> Total optimization steps = 624 [INFO|trainer.py:2143] 2024-09-25 07:48:57,293 >> Number of trainable parameters = 18,677,760 0%| | 0/624 [00:00<?, ?it/s]Traceback (most recent call last): File "/usr/local/bin/llamafactory-cli", line 8, in sys.exit(main()) File "/app/src/llamafactory/cli.py", line 96, in main run_exp() File "/app/src/llamafactory/train/tuner.py", line 33, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/app/src/llamafactory/train/sft/workflow.py", line 73, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1938, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2279, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3318, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss outputs = model(inputs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 820, in forward return model_forward(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 808, in call return convert_to_fp32(self.model_forward(*args, *kwargs)) File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 1577, in forward return self.base_model( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/peft/tuners/tuners_utils.py", line 188, in forward return self.model.forward(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/telechat-7b/modeling_telechat.py", line 799, in forward transformer_outputs = self.transformer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/root/.cache/huggingface/modules/transformers_modules/telechat-7b/modeling_telechat.py", line 709, in forward outputs = torch.utils.checkpoint.checkpoint( File "/usr/local/lib/python3.10/dist-packages/torch/_compile.py", line 24, in inner return torch._dynamo.disable(fn, recursive)(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 417, in _fn return fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/external_utils.py", line 25, in inner return fn(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 460, in checkpoint raise ValueError( ValueError: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.