hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
32.24k stars 3.95k forks source link

ValueError: Trying to set a tensor of shape torch.Size([197002752]) in "weight" (which has shape torch.Size([128256, 3072])), this look incorrect. #5596

Open amankumarhal opened 1 week ago

amankumarhal commented 1 week ago

Reminder

System Info

Reproduction

from transformers import AutoTokenizer, AutoModelForCausalLM import torch

tokenizer = AutoTokenizer.from_pretrained("./saves/llama3.2_domain/pretrain") model = AutoModelForCausalLM.from_pretrained( "./saves/llama3.2_domain/pretrain", device_map="auto")

Expected behavior

Model should load and provide output similar to base llama3.2 models.

More details

I have pretrained a llama3.2 1b and llama3.2 3b models using a domain-specific data. Once the model was trained, I encountered an error upon loading the model for inference "ValueError: Trying to set a tensor of shape torch.Size([197002752]) in "weight" (which has shape torch.Size([128256, 3072])), this look incorrect."

Below is my yaml file which I ran using : CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file examples/accelerate/fsdp_config.yaml src/train.py examples/train_full/llama3.2_small_full.yaml:

model

model_name_or_path: meta-llama/Llama-3.2-3B

method

stage: pt do_train: true finetuning_type: full

dataset

dataset: domain cutoff_len: 2048 overwrite_cache: true preprocessing_num_workers: 8

output

output_dir: saves/llama3.2_domain/pretrain logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-4 num_train_epochs: 1.0 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: true ddp_timeout: 1800 gradient_checkpointing: true

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 1000

It would be great if someone could help me understand what's wrong with my approach

Others

No response

Haruka1307 commented 6 days ago

I met the same error,and I guess may be "fsdp" caused it. fsdp "flatten" embed_tokens.weight and didn't get it back when saving model.I saw another issue https://github.com/huggingface/accelerate/issues/2374 on accelerate.unlukily,it didn't say the solution...

Haruka1307 commented 4 days ago

how about downgrade Accelerate to 0.30.0?

amankumarhal commented 4 days ago

@Haruka1307 Thank you for your comments. I can confirm that downgrading llamafactory to 0.8.3 and accelerate to 0.30.1 worked!!