intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.7k stars 1.26k forks source link

Could not use SFT Trainer in qlora_finetuning.py #12356

Open shungyantham opened 5 days ago

shungyantham commented 5 days ago

I have installed trl<0.12.0 to run qlora_finetune.py in the QLoRA/trl-example but it requires transformers 4.46.2 which causes the error below.

incorrect transformer versio

So I downgraded trl from 0.11.4 to 0.9.6 and I got another padding error.

padding issue
qiyuangong commented 4 days ago

I have installed trl<0.12.0 to run qlora_finetune.py in the QLoRA/trl-example but it requires transformers 4.46.2 which causes the error below. incorrect transformer versio

So I downgraded trl from 0.11.4 to 0.9.6 and I got another padding error. padding issue

These errors is caused by transformers version mismatch. Can you downgrade transformers version to 4.36.0?

pip install transformers==4.36.0 datasets
shungyantham commented 4 days ago

Hi, I have also downgraded the transformers to 4.36.0 when I downgrade the trl to 0.9.6 and I got this error

shungyantham commented 4 days ago

https://github.com/intel-analytics/ipex-llm/blob/main/docker/llm/finetune/xpu/Dockerfile

I build this Dockerfile and then manually pip install trl==0.9.6 in the docker container. I ran the qlora_finetune.py in LLM_Finetuning/QLoRA/trl-example. Is there anything I missed?

qiyuangong commented 4 days ago

https://github.com/intel-analytics/ipex-llm/blob/main/docker/llm/finetune/xpu/Dockerfile

I build this Dockerfile and then manually pip install trl==0.9.6 in the docker container. I ran the qlora_finetune.py in LLM_Finetuning/QLoRA/trl-example. Is there anything I missed?

Hi @shungyantham , we have reproduced this issue in our local env.

Please modify qlora_finetune.py Line 91. Add data_collator=transformers.DataCollatorForSeq2Seq( tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True ) to SFTTrainer.

Code should look like this:

    trainer = SFTTrainer(
        model=model,
        train_dataset=train_data,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=4,
            gradient_accumulation_steps= 1,
            warmup_steps=20,
            max_steps=200,
            learning_rate=2e-5,
            save_steps=100,
            bf16=True,  # bf16 is more stable in training
            logging_steps=20,
            output_dir="outputs",
            optim="adamw_hf", # paged_adamw_8bit is not supported yet
            gradient_checkpointing=True, # can further reduce memory but slower
        ),
        dataset_text_field="instruction",
        data_collator=transformers.DataCollatorForSeq2Seq(
            tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
        )
    )
qiyuangong commented 4 days ago

https://github.com/intel-analytics/ipex-llm/pull/12368

shungyantham commented 1 day ago

Hi @qiyuangong , I have faced another issue after adding the padding to the Trainer

image
qiyuangong commented 1 day ago

Hi @qiyuangong , I have faced another issue after adding the padding to the Trainer image

Please provide transformers and trl version, as well as finetune.py.