unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.57k stars 1.3k forks source link

Fine tune and infer llama3 with cpu #1037

Open SidneyLann opened 2 months ago

SidneyLann commented 2 months ago

import logging import os import json import torch from datasets import load_from_disk from transformers import TrainingArguments from trl import SFTTrainer from unsloth import FastLanguageModel

logger = logging.getLogger(name) logger.setLevel(logging.DEBUG)

DATA_HOME = "/home/sidney/app" MAX_SEQ_LENGTH = 5000

Defining the configuration for the base model, LoRA and training

config = { "hugging_face_username":"Shekswess", "model_config": { "base_model":os.path.join(DATA_HOME, "model_root/model_en"), # The base model "finetuned_model":os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen"), # The fine-tuned model "max_seq_length": MAX_SEQ_LENGTH, # The maximum sequence length "dtype":torch.float16, # The data type "load_in_4bit": True, # Load the model in 4-bit }, "lora_config": { "r": 16, # The number of LoRA layers 8, 16, 32, 64 "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], # The target modules "lora_alpha":16, # The alpha value for LoRA "lora_dropout":0, # The dropout value for LoRA "bias":"none", # The bias for LoRA "use_gradient_checkpointing":True, # Use gradient checkpointing "use_rslora":False, # Use RSLora "use_dora":False, # Use DoRa "loftq_config":None # The LoFTQ configuration }, "training_dataset":{ "name":os.path.join(DATA_HOME, "dataset_gen"), # The dataset name(huggingface/datasets) "split":"train", # The dataset split "input_field":"prompt", # The input field }, "training_config": { "per_device_train_batch_size": 1, # The batch size "gradient_accumulation_steps": 1, # The gradient accumulation steps "warmup_steps": 5, # The warmup steps "max_steps":0, # The maximum steps (0 if the epochs are defined) "num_train_epochs": 1, # The number of training epochs(0 if the maximum steps are defined) "learning_rate": 2e-4, # The learning rate "fp16": not torch.cuda.is_bf16_supported(), # The fp16 "bf16": torch.cuda.is_bf16_supported(), # The bf16 "logging_steps": 1, # The logging steps "optim" :"adamw_8bit", # The optimizer "weight_decay" : 0.01, # The weight decay "lr_scheduler_type": "linear", # The learning rate scheduler "seed" : 42, # The seed "output_dir" : "outputs", # The output directory } }

Loading the model and the tokinizer for the model

model, tokenizer = FastLanguageModel.from_pretrained( model_name = config.get("model_config").get("base_model"), max_seq_length = config.get("model_config").get("max_seq_length"), dtype = config.get("model_config").get("dtype"), load_in_4bit = config.get("model_config").get("load_in_4bit"), )

Setup for QLoRA/LoRA peft of the base model

model = FastLanguageModel.get_peft_model( model, r = config.get("lora_config").get("r"), target_modules = config.get("lora_config").get("target_modules"), lora_alpha = config.get("lora_config").get("lora_alpha"), lora_dropout = config.get("lora_config").get("lora_dropout"), bias = config.get("lora_config").get("bias"), use_gradient_checkpointing = config.get("lora_config").get("use_gradient_checkpointing"), random_state = 42, use_rslora = config.get("lora_config").get("use_rslora"), use_dora = config.get("lora_config").get("use_dora"), loftq_config = config.get("lora_config").get("loftq_config"), )

Loading the training dataset

dataset_train = load_from_disk(config.get("training_dataset").get("name"))['train']

Setting up the trainer for the model

trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset_train, dataset_text_field = config.get("training_dataset").get("input_field"), max_seq_length = config.get("model_config").get("max_seq_length"), dataset_num_proc = 1, packing = False, args = TrainingArguments( per_device_train_batch_size = config.get("training_config").get("per_device_train_batch_size"), gradient_accumulation_steps = config.get("training_config").get("gradient_accumulation_steps"), warmup_steps = config.get("training_config").get("warmup_steps"), max_steps = config.get("training_config").get("max_steps"), num_train_epochs= config.get("training_config").get("num_train_epochs"), learning_rate = config.get("training_config").get("learning_rate"), fp16 = config.get("training_config").get("fp16"), bf16 = config.get("training_config").get("bf16"), logging_steps = config.get("training_config").get("logging_steps"), optim = config.get("training_config").get("optim"), weight_decay = config.get("training_config").get("weight_decay"), lr_scheduler_type = config.get("training_config").get("lr_scheduler_type"), seed = 42, output_dir = config.get("training_config").get("output_dir"), ), )

Training the model

trainer_stats = trainer.train()

Saving the trainer stats

with open(os.path.join(DATA_HOME, "outputs/trainer_stats_gen.json"), "w") as f: json.dump(trainer_stats, f, indent=4)

Locally saving the model and pushing it to the Hugging Face Hub (only LoRA adapters)

model.save_pretrained(config.get("model_config").get("finetuned_model"))

Can amend this code to use cpu?

danielhanchen commented 2 months ago

You should convert to GGUF for CPU inference - or you can use direct HF inference

SidneyLann commented 2 months ago

Hi, any links for reference?

Linguiniotta commented 2 months ago

There's instruction in the wiki for converting to GGUF, but is it possible to fine-tune / train with TPU or CPU? I get an error when importing unsloth FastLanguageModel. I maxed out my GPU quota on Kaggle lol.

Installation / Import
# https://github.com/unslothai/unsloth/issues/998
!pip install --quiet pip3-autoremove
!pip-autoremove torch torchvision torchaudio -y
!pip install --quiet "torch==2.4.0" "xformers==0.0.27.post2" triton torchvision torchaudio
!pip install --quiet "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"
%%time
from unsloth import FastLanguageModel
from accelerate import Accelerator
Error
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File :1

File /usr/local/lib/python3.10/site-packages/unsloth/__init__.py:83
     80 pass
     82 # Torch 2.4 has including_emulation
---> 83 major_version, minor_version = torch.cuda.get_device_capability()
     84 SUPPORTS_BFLOAT16 = (major_version >= 8)
     86 old_is_bf16_supported = torch.cuda.is_bf16_supported

File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:451, in get_device_capability(device)
    438 def get_device_capability(device: Optional[_device_t] = None) -> Tuple[int, int]:
    439     r"""Get the cuda capability of a device.
    440 
    441     Args:
   (...)
    449         tuple(int, int): the major and minor cuda capability of the device
    450     """
--> 451     prop = get_device_properties(device)
    452     return prop.major, prop.minor

File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:465, in get_device_properties(device)
    455 def get_device_properties(device: _device_t) -> _CudaDeviceProperties:
    456     r"""Get the properties of a device.
    457 
    458     Args:
   (...)
    463         _CudaDeviceProperties: the properties of the device
    464     """
--> 465     _lazy_init()  # will define _get_device_properties
    466     device = _get_device_index(device, optional=True)
    467     if device < 0 or device >= device_count():

File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:314, in _lazy_init()
    312 if "CUDA_MODULE_LOADING" not in os.environ:
    313     os.environ["CUDA_MODULE_LOADING"] = "LAZY"
--> 314 torch._C._cuda_init()
    315 # Some of the queued calls may reentrantly call _lazy_init();
    316 # we need to just return without initializing in that case.
    317 # However, we must not let any *other* threads in!
    318 _tls.is_initializing = True

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
SidneyLann commented 2 months ago

Where is the instruction to use llama.cpp to load gguf and infer?

Linguiniotta commented 2 months ago

It is in their GH :) https://github.com/ggerganov/llama.cpp#usage

SidneyLann commented 2 months ago

import os import sys import json import torch from datasets import load_dataset from unsloth import FastLanguageModel

INSTRUCTION = "does the user input content contain bus?" DATA_HOME = "/home/sidney/app" MAX_SEQ_LENGTH = 5000 SEQ_START_IDX = 512

config = { "model_config": { "finetuned_model": os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen"), "max_seq_length": MAX_SEQ_LENGTH,
"dtype": torch.float16, "load_in_4bit": True,
} }

model, tokenizer = FastLanguageModel.from_pretrained( model_name=config.get("model_config").get("finetuned_model"), max_seq_length=config.get("model_config").get("max_seq_length"), dtype=config.get("model_config").get("dtype"), load_in_4bit=config.get("model_config").get("load_in_4bit"), )

FastLanguageModel.for_inference(model) dataset_path = sys.argv[1] dateStr=dataset_path[-8:] files = [os.path.join(dataset_path, f) for f in os.listdir(dataset_path)] fileCount = 0 genCount = 0 exceptFileName = '' for fileName in files: file_size = os.path.getsize(fileName) fileCount = fileCount + 1 print('fileCount: ', fileCount, genCount, file_size, dateStr, fileName, exceptFileName) if file_size < 8192: continue genCount = genCount + 1

with open(fileName) as f:
    content = f.read()
    print("content Size is :", len(content))
    if len(content) > MAX_SEQ_LENGTH+SEQ_START_IDX:
        content = content[SEQ_START_IDX:MAX_SEQ_LENGTH+SEQ_START_IDX]
    inputs = tokenizer(
        [
            f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nAt date {dateStr}, {INSTRUCTION}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{content}<|eot_id|>"""
        ], return_tensors="pt").to("cuda")
    try:
        outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
    except:
        exceptFileName = fileName
        continue
    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    print(outputs[0])

How to amend this code to use llama.cpp to infer by gguf?

Linguiniotta commented 2 months ago

You are still using the unsloth model. Convert it first to gguf THEN infer.

SidneyLann commented 2 months ago

Had converted, but don't know how to use llama.cpp like unsloth to do the reference.

danielhanchen commented 1 month ago

@SidneyLann Another option is to use HuggingFace CPU directly after finetuning:

    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")
SidneyLann commented 1 month ago

Can't infer by the lora_model which fine tune by gpu, it still use gpu, must fine tune by cpu to do cpu inference?

danielhanchen commented 1 month ago

@SidneyLann You need to save the LoRA adapter (finetuned by CPU or GPU) then load it on a CPU only machine - it should work!

SidneyLann commented 1 month ago

My machine has one gpu to do other tasks, can't do the cpu inference in this machine? Can't config it by a indicator?

danielhanchen commented 1 month ago

@SidneyLann Yes you can set device_map = "cpu" for example in the loading module to force it to CPU

SidneyLann commented 1 month ago

config = { "model_config": { "finetuned_model": os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen110"), "max_seq_length": 5000, "dtype": torch.float32, "load_in_4bit": True, "device_map": "cpu", } }

model_name=config.get("model_config").get("finetuned_model") device_map=config.get("model_config").get("device_map") model = AutoPeftModelForCausalLM.from_pretrained( model_name, load_in_4bit=config.get("model_config").get("load_in_4bit"), device_map=device_map, ) tokenizer = AutoTokenizer.from_pretrained(model_name)

inputs = tokenizer( [ f"""<|begin_of_text|>......<|eot_id|>""" ], return_tensors = "pt").to(device_map) outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True) outputs = tokenizer.batch_decode(outputs, skip_special_tokens = True) print(outputs[0])

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Setting pad_token_id to eos_token_id:128001 for open-end generation. FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first. ...... File "/usr/prg/python/3102/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 468, in forward out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state) File "/usr/prg/python/3102/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 566, in matmul_4bit assert quant_state is not None AssertionError

quant_state is None, why .to(device_map) not work?

danielhanchen commented 1 month ago

@SidneyLann Actually you're correct - bitsandbytes only works on GPU :(

Have you considered exporting to GGUF / llama.cpp / Ollama for inference?

Another way is to use load_in_4bit = False

SidneyLann commented 1 week ago

image 7 3 image

For the same model, gpu 1080 ti spent several minutes and infered one example, cpu i9 spent 1+ hours to wait io to disk(top wa 50%) and pending to infer one example. What's the problem? gpu inference need 12g VRAM, cpu inferenct need 60g RAM?

SidneyLann commented 1 week ago

llama 3.2 work now

SidneyLann commented 1 week ago

cpu i9 spent 30 seconds and gpu 1080 ti spent 10 seconds to infer one example, is it normal? I think gpu should be 10+ times faster then cpu.