intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.45k stars 1.24k forks source link

Suggestions for speeding up Alpaca Qlora training on CPU? #9428

Open tsantra opened 9 months ago

tsantra commented 9 months ago

Hi,

I have ported the Alpaca Qlora code given for GPU example, to CPU. I am using Sapphire Rapids for training.

These are my code changes:

model = AutoModelForCausalLM.from_pretrained(
base_model,

load_in_low_bit="nf4", # According to the QLoRA paper, using "nf4" could yield better model quality than "int4"

        load_in_low_bit="sym_int4", #using this for cpu
        optimize_model=False,
        #torch_dtype=torch.bfloat16,
        torch_dtype=torch.float16,  #using this for cpu
        # device_map=device_map,
        modules_to_not_convert=["lm_head"],            
    )

Using default values in code:

training hyperparams batch_size: int = 128, micro_batch_size: int = 2, # default to be 2, limited by GPU memory num_epochs: int = 3, learning_rate: float = 3e-5, # default to be 3e-5 to avoid divergence cutoff_len: int = 256, val_set_size: int = 2000,

trainer = transformers.Trainer( model=model, train_dataset=train_data, eval_dataset=val_data, args=transformers.TrainingArguments( per_device_train_batch_size=micro_batch_size, gradient_accumulation_steps=gradient_accumulation_steps,

warmup_ratio=0.03,

        # warmup_steps=100,
        max_grad_norm=0.3,
        num_train_epochs=num_epochs,
        learning_rate=learning_rate,
        lr_scheduler_type="cosine",
        bf16=True,  # ensure training more stable
        logging_steps=1,
        optim="adamw_torch",
        evaluation_strategy="steps" if val_set_size > 0 else "no",
        save_strategy="steps",
        eval_steps=100 if val_set_size > 0 else None,
        save_steps=100,
        output_dir=output_dir,
        save_total_limit=100,
        load_best_model_at_end=True if val_set_size > 0 else False,
        #ddp_find_unused_parameters=False if ddp else None,
        group_by_length=group_by_length,
        report_to="wandb" if use_wandb else None,
        run_name=wandb_run_name if use_wandb else None,
        #gradient_checkpointing=gradient_checkpointing, #for cpu commenting out
        #ddp_backend="ccl",      #commenting out for cpu
        #deepspeed=deepspeed,    #commenting out for cpu
    ),
    data_collator=transformers.DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
    ),
)

Training has been going on without any error. However, is there any setting that I should apply for better speed in CPU? Or in general do you have any guidance for training on CPU?

jason-dai commented 9 months ago

See the CPU QLoRA finetuning example here.

tsantra commented 9 months ago

@jason-dai Thank you for replying. I am trying to train Alpaca Qlora on CPU. I modified the Alpaca code for GPU to run on CPU. Its running without any error so far but it is slow. Apart from using bigdl-llm-init for speed up on CPU, would you suggest any other changes for speed up for the Alpaca code without affecting training stability? I am using default hyperparameters as set in the code.

Using default values in code:

training hyperparams batch_size: int = 128, micro_batch_size: int = 2, num_epochs: int = 3, learning_rate: float = 3e-5, # default to be 3e-5 to avoid divergence cutoff_len: int = 256, val_set_size: int = 2000, gradient_accumulation_step = batch_size/micro_batch_size = 64

image
glorysdj commented 9 months ago

Hi @tsantra , apart from using bigdl-llm-init, you may also try tuning the batchsize to smaller ones.