SidneyLann commented 2 months ago

import logging import os import json import torch from datasets import load_from_disk from transformers import TrainingArguments from trl import SFTTrainer from unsloth import FastLanguageModel

logger = logging.getLogger(name) logger.setLevel(logging.DEBUG)

DATA_HOME = "/home/sidney/app" MAX_SEQ_LENGTH = 5000

Defining the configuration for the base model, LoRA and training

config = { "hugging_face_username":"Shekswess", "model_config": { "base_model":os.path.join(DATA_HOME, "model_root/model_en"), # The base model "finetuned_model":os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen"), # The fine-tuned model "max_seq_length": MAX_SEQ_LENGTH, # The maximum sequence length "dtype":torch.float16, # The data type "load_in_4bit": True, # Load the model in 4-bit }, "lora_config": { "r": 16, # The number of LoRA layers 8, 16, 32, 64 "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], # The target modules "lora_alpha":16, # The alpha value for LoRA "lora_dropout":0, # The dropout value for LoRA "bias":"none", # The bias for LoRA "use_gradient_checkpointing":True, # Use gradient checkpointing "use_rslora":False, # Use RSLora "use_dora":False, # Use DoRa "loftq_config":None # The LoFTQ configuration }, "training_dataset":{ "name":os.path.join(DATA_HOME, "dataset_gen"), # The dataset name(huggingface/datasets) "split":"train", # The dataset split "input_field":"prompt", # The input field }, "training_config": { "per_device_train_batch_size": 1, # The batch size "gradient_accumulation_steps": 1, # The gradient accumulation steps "warmup_steps": 5, # The warmup steps "max_steps":0, # The maximum steps (0 if the epochs are defined) "num_train_epochs": 1, # The number of training epochs(0 if the maximum steps are defined) "learning_rate": 2e-4, # The learning rate "fp16": not torch.cuda.is_bf16_supported(), # The fp16 "bf16": torch.cuda.is_bf16_supported(), # The bf16 "logging_steps": 1, # The logging steps "optim" :"adamw_8bit", # The optimizer "weight_decay" : 0.01, # The weight decay "lr_scheduler_type": "linear", # The learning rate scheduler "seed" : 42, # The seed "output_dir" : "outputs", # The output directory } }

Loading the model and the tokinizer for the model

model, tokenizer = FastLanguageModel.from_pretrained( model_name = config.get("model_config").get("base_model"), max_seq_length = config.get("model_config").get("max_seq_length"), dtype = config.get("model_config").get("dtype"), load_in_4bit = config.get("model_config").get("load_in_4bit"), )

Setup for QLoRA/LoRA peft of the base model

model = FastLanguageModel.get_peft_model( model, r = config.get("lora_config").get("r"), target_modules = config.get("lora_config").get("target_modules"), lora_alpha = config.get("lora_config").get("lora_alpha"), lora_dropout = config.get("lora_config").get("lora_dropout"), bias = config.get("lora_config").get("bias"), use_gradient_checkpointing = config.get("lora_config").get("use_gradient_checkpointing"), random_state = 42, use_rslora = config.get("lora_config").get("use_rslora"), use_dora = config.get("lora_config").get("use_dora"), loftq_config = config.get("lora_config").get("loftq_config"), )

Loading the training dataset

dataset_train = load_from_disk(config.get("training_dataset").get("name"))['train']

Setting up the trainer for the model

trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset_train, dataset_text_field = config.get("training_dataset").get("input_field"), max_seq_length = config.get("model_config").get("max_seq_length"), dataset_num_proc = 1, packing = False, args = TrainingArguments( per_device_train_batch_size = config.get("training_config").get("per_device_train_batch_size"), gradient_accumulation_steps = config.get("training_config").get("gradient_accumulation_steps"), warmup_steps = config.get("training_config").get("warmup_steps"), max_steps = config.get("training_config").get("max_steps"), num_train_epochs= config.get("training_config").get("num_train_epochs"), learning_rate = config.get("training_config").get("learning_rate"), fp16 = config.get("training_config").get("fp16"), bf16 = config.get("training_config").get("bf16"), logging_steps = config.get("training_config").get("logging_steps"), optim = config.get("training_config").get("optim"), weight_decay = config.get("training_config").get("weight_decay"), lr_scheduler_type = config.get("training_config").get("lr_scheduler_type"), seed = 42, output_dir = config.get("training_config").get("output_dir"), ), )

Training the model

trainer_stats = trainer.train()

Saving the trainer stats

with open(os.path.join(DATA_HOME, "outputs/trainer_stats_gen.json"), "w") as f: json.dump(trainer_stats, f, indent=4)

Locally saving the model and pushing it to the Hugging Face Hub (only LoRA adapters)

model.save_pretrained(config.get("model_config").get("finetuned_model"))

Can amend this code to use cpu?

danielhanchen commented 2 months ago

You should convert to GGUF for CPU inference - or you can use direct HF inference

SidneyLann commented 2 months ago

Hi, any links for reference?

Linguiniotta commented 2 months ago

There's instruction in the wiki for converting to GGUF, but is it possible to fine-tune / train with TPU or CPU? I get an error when importing unsloth FastLanguageModel. I maxed out my GPU quota on Kaggle lol.

Installation / Import

# https://github.com/unslothai/unsloth/issues/998
!pip install --quiet pip3-autoremove
!pip-autoremove torch torchvision torchaudio -y
!pip install --quiet "torch==2.4.0" "xformers==0.0.27.post2" triton torchvision torchaudio
!pip install --quiet "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"

%%time
from unsloth import FastLanguageModel
from accelerate import Accelerator

Error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File :1

File /usr/local/lib/python3.10/site-packages/unsloth/__init__.py:83
     80 pass
     82 # Torch 2.4 has including_emulation
---> 83 major_version, minor_version = torch.cuda.get_device_capability()
     84 SUPPORTS_BFLOAT16 = (major_version >= 8)
     86 old_is_bf16_supported = torch.cuda.is_bf16_supported

File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:451, in get_device_capability(device)
    438 def get_device_capability(device: Optional[_device_t] = None) -> Tuple[int, int]:
    439     r"""Get the cuda capability of a device.
    440 
    441     Args:
   (...)
    449         tuple(int, int): the major and minor cuda capability of the device
    450     """
--> 451     prop = get_device_properties(device)
    452     return prop.major, prop.minor

File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:465, in get_device_properties(device)
    455 def get_device_properties(device: _device_t) -> _CudaDeviceProperties:
    456     r"""Get the properties of a device.
    457 
    458     Args:
   (...)
    463         _CudaDeviceProperties: the properties of the device
    464     """
--> 465     _lazy_init()  # will define _get_device_properties
    466     device = _get_device_index(device, optional=True)
    467     if device < 0 or device >= device_count():

File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:314, in _lazy_init()
    312 if "CUDA_MODULE_LOADING" not in os.environ:
    313     os.environ["CUDA_MODULE_LOADING"] = "LAZY"
--> 314 torch._C._cuda_init()
    315 # Some of the queued calls may reentrantly call _lazy_init();
    316 # we need to just return without initializing in that case.
    317 # However, we must not let any *other* threads in!
    318 _tls.is_initializing = True

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

SidneyLann commented 2 months ago

Where is the instruction to use llama.cpp to load gguf and infer?

Linguiniotta commented 2 months ago

It is in their GH :) https://github.com/ggerganov/llama.cpp#usage

SidneyLann commented 2 months ago

import os import sys import json import torch from datasets import load_dataset from unsloth import FastLanguageModel

INSTRUCTION = "does the user input content contain bus?" DATA_HOME = "/home/sidney/app" MAX_SEQ_LENGTH = 5000 SEQ_START_IDX = 512

config = { "model_config": { "finetuned_model": os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen"), "max_seq_length": MAX_SEQ_LENGTH,
"dtype": torch.float16, "load_in_4bit": True,
} }

model, tokenizer = FastLanguageModel.from_pretrained( model_name=config.get("model_config").get("finetuned_model"), max_seq_length=config.get("model_config").get("max_seq_length"), dtype=config.get("model_config").get("dtype"), load_in_4bit=config.get("model_config").get("load_in_4bit"), )

FastLanguageModel.for_inference(model) dataset_path = sys.argv[1] dateStr=dataset_path[-8:] files = [os.path.join(dataset_path, f) for f in os.listdir(dataset_path)] fileCount = 0 genCount = 0 exceptFileName = '' for fileName in files: file_size = os.path.getsize(fileName) fileCount = fileCount + 1 print('fileCount: ', fileCount, genCount, file_size, dateStr, fileName, exceptFileName) if file_size < 8192: continue genCount = genCount + 1

with open(fileName) as f:
    content = f.read()
    print("content Size is :", len(content))
    if len(content) > MAX_SEQ_LENGTH+SEQ_START_IDX:
        content = content[SEQ_START_IDX:MAX_SEQ_LENGTH+SEQ_START_IDX]
    inputs = tokenizer(
        [
            f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nAt date {dateStr}, {INSTRUCTION}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{content}<|eot_id|>"""
        ], return_tensors="pt").to("cuda")
    try:
        outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
    except:
        exceptFileName = fileName
        continue
    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    print(outputs[0])

How to amend this code to use llama.cpp to infer by gguf?

Linguiniotta commented 2 months ago

You are still using the unsloth model. Convert it first to gguf THEN infer.

SidneyLann commented 2 months ago

Had converted, but don't know how to use llama.cpp like unsloth to do the reference.

danielhanchen commented 1 month ago

@SidneyLann Another option is to use HuggingFace CPU directly after finetuning:

    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

SidneyLann commented 1 month ago

Can't infer by the lora_model which fine tune by gpu, it still use gpu, must fine tune by cpu to do cpu inference?

danielhanchen commented 1 month ago

@SidneyLann You need to save the LoRA adapter (finetuned by CPU or GPU) then load it on a CPU only machine - it should work!

SidneyLann commented 1 month ago

My machine has one gpu to do other tasks, can't do the cpu inference in this machine? Can't config it by a indicator?

danielhanchen commented 1 month ago

@SidneyLann Yes you can set device_map = "cpu" for example in the loading module to force it to CPU

SidneyLann commented 1 month ago

config = { "model_config": { "finetuned_model": os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen110"), "max_seq_length": 5000, "dtype": torch.float32, "load_in_4bit": True, "device_map": "cpu", } }

model_name=config.get("model_config").get("finetuned_model") device_map=config.get("model_config").get("device_map") model = AutoPeftModelForCausalLM.from_pretrained( model_name, load_in_4bit=config.get("model_config").get("load_in_4bit"), device_map=device_map, ) tokenizer = AutoTokenizer.from_pretrained(model_name)

inputs = tokenizer( [ f"""<|begin_of_text|>......<|eot_id|>""" ], return_tensors = "pt").to(device_map) outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True) outputs = tokenizer.batch_decode(outputs, skip_special_tokens = True) print(outputs[0])

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Setting pad_token_id to eos_token_id:128001 for open-end generation. FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first. ...... File "/usr/prg/python/3102/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 468, in forward out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state) File "/usr/prg/python/3102/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 566, in matmul_4bit assert quant_state is not None AssertionError

quant_state is None, why .to(device_map) not work?

danielhanchen commented 1 month ago

@SidneyLann Actually you're correct - bitsandbytes only works on GPU :(

Have you considered exporting to GGUF / llama.cpp / Ollama for inference?

Another way is to use load_in_4bit = False

SidneyLann commented 1 week ago

For the same model, gpu 1080 ti spent several minutes and infered one example, cpu i9 spent 1+ hours to wait io to disk(top wa 50%) and pending to infer one example. What's the problem? gpu inference need 12g VRAM, cpu inferenct need 60g RAM?

SidneyLann commented 1 week ago

llama 3.2 work now

SidneyLann commented 1 week ago

cpu i9 spent 30 seconds and gpu 1080 ti spent 10 seconds to infer one example, is it normal? I think gpu should be 10+ times faster then cpu.

unslothai / unsloth

Fine tune and infer llama3 with cpu #1037

Defining the configuration for the base model, LoRA and training

Loading the model and the tokinizer for the model

Setup for QLoRA/LoRA peft of the base model

Loading the training dataset

Setting up the trainer for the model

Training the model

Saving the trainer stats

Locally saving the model and pushing it to the Hugging Face Hub (only LoRA adapters)

How to amend this code to use llama.cpp to infer by gguf?