example里报错：RuntimeError: 'weight' must be 2-D

macqueen09 commented 3 months ago

          > PPO 阶段不支持 DeepSpeed，仅支持 Accelerate

Accelerate 的启动命令应该是怎么样呢，可以指定几个gpu吗？

Originally posted by @yuye2133 in https://github.com/hiyouga/LLaMA-Factory/issues/831#issuecomment-1709987193

CUDA_VISIBLE_DEVICES=7,6,3,4,5,2 llamafactory-cli train examples/lora_multi_gpu/llama3_lora_sft_ds.yaml 默认命令也报错 [rank0]: File "/opt/anaconda3/envs/lavis/lib/python3.8/site-packages/torch/nn/functional.py", line 2264, in embedding [rank0]: return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) [rank0]: RuntimeError: 'weight' must be 2-D

似乎和ppo不支持deepspeed一样的报错，如何修改为accelerate

macqueen09 commented 3 months ago

transformers 4.39.0 torch 2.3.0 llamafactory 0.7.2.dev0 flash-attn 2.5.8 deepspeed 0.14.0 accelerate 0.29.1 cuda12.1

macqueen09 commented 3 months ago

CUDA_VISIBLE_DEVICES=1 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml 默认命令单卡lora则正常，多卡带deepspeed都是这个报错。

hiyouga commented 3 months ago

https://github.com/hiyouga/LLaMA-Factory/tree/main/examples#supervised-fine-tuning-with-accelerate-on-single-node

macqueen09 commented 3 months ago

@hiyouga Thanks a lot，但是这个example里面我都试了分别出现不同问题 CUDA_VISIBLE_DEVICES=0,1,2,3 llamafactory-cli train examples/full_multi_gpu/llama3_full_sft.yaml 会出现标题这个报错 —————————— CUDA_VISIBLE_DEVICES=4,1,2,3,6,5 llamafactory-cli train examples/lora_multi_gpu/llama3_lora_sft.yaml 这个命令训练llama3 7B llama3_8B_instruct，即便是6快 80G的卡都出现显存不够的情况，我感觉肯定是还有别的配置问题

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:32<00:00,  8.01s/it]
[INFO|modeling_utils.py:4024] 2024-05-28 21:11:25,755 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4032] 2024-05-28 21:11:25,755 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /supercloud/llm-code/mkl/dataset/llama3/llama3_8B_instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:881] 2024-05-28 21:11:25,757 >> loading configuration file /supercloud/llm-code/mkl/dataset/llama3/llama3_8B_instruct/generation_config.json
[INFO|configuration_utils.py:928] 2024-05-28 21:11:25,758 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}

05/28/2024 21:11:25 - INFO - llamafactory.model.utils.checkpointing - Gradient checkpointing enabled.
05/28/2024 21:11:25 - INFO - llamafactory.model.utils.attention - Using torch SDPA for faster training and inference.
05/28/2024 21:11:25 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
05/28/2024 21:11:25 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
05/28/2024 21:11:26 - INFO - llamafactory.model.loader - trainable params: 3407872 || all params: 8033669120 || trainable%: 0.0424
/opt/anaconda3/envs/lavis/lib/python3.8/site-packages/accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  warnings.warn(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:607] 2024-05-28 21:11:27,298 >> Using auto half precision backend
[INFO|trainer.py:1969] 2024-05-28 21:11:27,493 >> ***** Running training *****
[INFO|trainer.py:1970] 2024-05-28 21:11:27,493 >>   Num examples = 981
[INFO|trainer.py:1971] 2024-05-28 21:11:27,493 >>   Num Epochs = 3
[INFO|trainer.py:1972] 2024-05-28 21:11:27,493 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1974] 2024-05-28 21:11:27,493 >>   Training with DataParallel so batch size has been adjusted to: 6
[INFO|trainer.py:1975] 2024-05-28 21:11:27,493 >>   Total train batch size (w. parallel, distributed & accumulation) = 12
[INFO|trainer.py:1976] 2024-05-28 21:11:27,493 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:1977] 2024-05-28 21:11:27,493 >>   Total optimization steps = 246
[INFO|trainer.py:1978] 2024-05-28 21:11:27,494 >>   Number of trainable parameters = 3,407,872
  0%|                                                                                                                             | 0/246 [00:00<?, ?it/s]/opt/anaconda3/envs/lavis/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
  File "/opt/anaconda3/envs/lavis/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
  File "/supercloud/llm-code/mkl/project/LLaMA-Factory/src/llamafactory/cli.py", line 65, in main
    run_exp()
  File "/supercloud/llm-code/mkl/project/LLaMA-Factory/src/llamafactory/train/tuner.py", line 33, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/supercloud/llm-code/mkl/project/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 73, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/opt/anaconda3/envs/lavis/lib/python3.8/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/opt/anaconda3/envs/lavis/lib/python3.8/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/anaconda3/envs/lavis/lib/python3.8/site-packages/transformers/trainer.py", line 3045, in training_step
    self.accelerator.backward(loss)
  File "/opt/anaconda3/envs/lavis/lib/python3.8/site-packages/accelerate/accelerator.py", line 2013, in backward
    loss.backward(**kwargs)
  File "/opt/anaconda3/envs/lavis/lib/python3.8/site-packages/torch/_tensor.py", line 525, in backward
    torch.autograd.backward(
  File "/opt/anaconda3/envs/lavis/lib/python3.8/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/opt/anaconda3/envs/lavis/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/anaconda3/envs/lavis/lib/python3.8/site-packages/torch/autograd/function.py", line 301, in apply
    return user_fn(self, *args)
  File "/opt/anaconda3/envs/lavis/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 320, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/opt/anaconda3/envs/lavis/lib/python3.8/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/opt/anaconda3/envs/lavis/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 
  0%|                                                                                                                             | 0/246 [00:31<?, ?it/s

macqueen09 commented 3 months ago

/supercloud/llm-code/mkl/project/LLaMA-Factory/examples/lora_multi_gpu/single_node.sh 这个脚本的话，accelerate可以正常多卡训练，但是你给的链接里那几个example，分别确实有上面的报错和问题，显存不足的，我换了其他的deepspeed.json配置，降低batch=1等，也还是不够。加到7卡 7*80G 也还是不够

hiyouga commented 3 months ago

更新代码。

hxtkyne commented 3 months ago

更新代码。

已经更新到最新代码，用文档中的ds.yaml配置训练，依然报错

sunzhufeng12345 commented 3 months ago

我也是，已经把代码更新到最新的，但是运行给出的示例里的多卡微调仍然会报这个错误

hxtkyne commented 3 months ago

邮件已收到了哦亲，经常联系啊。好朋友走一生

coranholmes commented 3 months ago

/supercloud/llm-code/mkl/project/LLaMA-Factory/examples/lora_multi_gpu/single_node.sh 这个脚本的话，accelerate可以正常多卡训练，但是你给的链接里那几个example，分别确实有上面的报错和问题，显存不足的，我换了其他的deepspeed.json配置，降低batch=1等，也还是不够。加到7卡 7*80G 也还是不够

我也遇到了相同的问题，拉了最新的代码还是不行，请问你解决了吗？

macqueen09 commented 3 months ago

邮件已收到了哦亲，经常联系啊。好朋友走一生拉了最新代码还是报错

@hxtkyne 请问您是找到解决办法了嘛？我这最新代码example里还都是不行，手动改用对应目录下的 sh 文件使用里面写的torchrun方式可以多卡微调。

yetzi1975 commented 2 months ago

'weight' must be 2-D 我遇到这个问题是因为安装时没执行： pip install -e ".[torch,metrics]" 执行了，就好了

hiyouga / LLaMA-Factory

example里报错：RuntimeError: 'weight' must be 2-D #3947