tsantra commented 9 months ago

Hi,

I have ported the Alpaca Qlora code given for GPU example, to CPU. I am using Sapphire Rapids for training.

These are my code changes:

model = AutoModelForCausalLM.from_pretrained(
base_model,

load_in_low_bit="nf4", # According to the QLoRA paper, using "nf4" could yield better model quality than "int4"

        load_in_low_bit="sym_int4", #using this for cpu
        optimize_model=False,
        #torch_dtype=torch.bfloat16,
        torch_dtype=torch.float16,  #using this for cpu
        # device_map=device_map,
        modules_to_not_convert=["lm_head"],            
    )

Using default values in code:

training hyperparams batch_size: int = 128, micro_batch_size: int = 2, # default to be 2, limited by GPU memory num_epochs: int = 3, learning_rate: float = 3e-5, # default to be 3e-5 to avoid divergence cutoff_len: int = 256, val_set_size: int = 2000,

trainer = transformers.Trainer( model=model, train_dataset=train_data, eval_dataset=val_data, args=transformers.TrainingArguments( per_device_train_batch_size=micro_batch_size, gradient_accumulation_steps=gradient_accumulation_steps,

warmup_ratio=0.03,

        # warmup_steps=100,
        max_grad_norm=0.3,
        num_train_epochs=num_epochs,
        learning_rate=learning_rate,
        lr_scheduler_type="cosine",
        bf16=True,  # ensure training more stable
        logging_steps=1,
        optim="adamw_torch",
        evaluation_strategy="steps" if val_set_size > 0 else "no",
        save_strategy="steps",
        eval_steps=100 if val_set_size > 0 else None,
        save_steps=100,
        output_dir=output_dir,
        save_total_limit=100,
        load_best_model_at_end=True if val_set_size > 0 else False,
        #ddp_find_unused_parameters=False if ddp else None,
        group_by_length=group_by_length,
        report_to="wandb" if use_wandb else None,
        run_name=wandb_run_name if use_wandb else None,
        #gradient_checkpointing=gradient_checkpointing, #for cpu commenting out
        #ddp_backend="ccl",      #commenting out for cpu
        #deepspeed=deepspeed,    #commenting out for cpu
    ),
    data_collator=transformers.DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
    ),
)

Training has been going on without any error. However, is there any setting that I should apply for better speed in CPU? Or in general do you have any guidance for training on CPU?

jason-dai commented 9 months ago

See the CPU QLoRA finetuning example here.

tsantra commented 9 months ago

@jason-dai Thank you for replying. I am trying to train Alpaca Qlora on CPU. I modified the Alpaca code for GPU to run on CPU. Its running without any error so far but it is slow. Apart from using bigdl-llm-init for speed up on CPU, would you suggest any other changes for speed up for the Alpaca code without affecting training stability? I am using default hyperparameters as set in the code.

Using default values in code:

training hyperparams batch_size: int = 128, micro_batch_size: int = 2, num_epochs: int = 3, learning_rate: float = 3e-5, # default to be 3e-5 to avoid divergence cutoff_len: int = 256, val_set_size: int = 2000, gradient_accumulation_step = batch_size/micro_batch_size = 64

glorysdj commented 9 months ago

Hi @tsantra , apart from using bigdl-llm-init, you may also try tuning the batchsize to smaller ones.

intel-analytics / ipex-llm

Suggestions for speeding up Alpaca Qlora training on CPU? #9428

load_in_low_bit="nf4", # According to the QLoRA paper, using "nf4" could yield better model quality than "int4"

Using default values in code:

warmup_ratio=0.03,

Using default values in code: