Open SidneyLann opened 2 months ago
You should convert to GGUF for CPU inference - or you can use direct HF inference
Hi, any links for reference?
There's instruction in the wiki for converting to GGUF, but is it possible to fine-tune / train with TPU or CPU? I get an error when importing unsloth FastLanguageModel. I maxed out my GPU quota on Kaggle lol.
# https://github.com/unslothai/unsloth/issues/998 !pip install --quiet pip3-autoremove !pip-autoremove torch torchvision torchaudio -y !pip install --quiet "torch==2.4.0" "xformers==0.0.27.post2" triton torchvision torchaudio !pip install --quiet "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"
%%time from unsloth import FastLanguageModel from accelerate import Accelerator
--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) File:1 File /usr/local/lib/python3.10/site-packages/unsloth/__init__.py:83 80 pass 82 # Torch 2.4 has including_emulation ---> 83 major_version, minor_version = torch.cuda.get_device_capability() 84 SUPPORTS_BFLOAT16 = (major_version >= 8) 86 old_is_bf16_supported = torch.cuda.is_bf16_supported File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:451, in get_device_capability(device) 438 def get_device_capability(device: Optional[_device_t] = None) -> Tuple[int, int]: 439 r"""Get the cuda capability of a device. 440 441 Args: (...) 449 tuple(int, int): the major and minor cuda capability of the device 450 """ --> 451 prop = get_device_properties(device) 452 return prop.major, prop.minor File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:465, in get_device_properties(device) 455 def get_device_properties(device: _device_t) -> _CudaDeviceProperties: 456 r"""Get the properties of a device. 457 458 Args: (...) 463 _CudaDeviceProperties: the properties of the device 464 """ --> 465 _lazy_init() # will define _get_device_properties 466 device = _get_device_index(device, optional=True) 467 if device < 0 or device >= device_count(): File /usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py:314, in _lazy_init() 312 if "CUDA_MODULE_LOADING" not in os.environ: 313 os.environ["CUDA_MODULE_LOADING"] = "LAZY" --> 314 torch._C._cuda_init() 315 # Some of the queued calls may reentrantly call _lazy_init(); 316 # we need to just return without initializing in that case. 317 # However, we must not let any *other* threads in! 318 _tls.is_initializing = True RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Where is the instruction to use llama.cpp to load gguf and infer?
It is in their GH :) https://github.com/ggerganov/llama.cpp#usage
import os import sys import json import torch from datasets import load_dataset from unsloth import FastLanguageModel
INSTRUCTION = "does the user input content contain bus?" DATA_HOME = "/home/sidney/app" MAX_SEQ_LENGTH = 5000 SEQ_START_IDX = 512
config = {
"model_config": {
"finetuned_model": os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen"),
"max_seq_length": MAX_SEQ_LENGTH,
"dtype": torch.float16,
"load_in_4bit": True,
}
}
model, tokenizer = FastLanguageModel.from_pretrained( model_name=config.get("model_config").get("finetuned_model"), max_seq_length=config.get("model_config").get("max_seq_length"), dtype=config.get("model_config").get("dtype"), load_in_4bit=config.get("model_config").get("load_in_4bit"), )
FastLanguageModel.for_inference(model) dataset_path = sys.argv[1] dateStr=dataset_path[-8:] files = [os.path.join(dataset_path, f) for f in os.listdir(dataset_path)] fileCount = 0 genCount = 0 exceptFileName = '' for fileName in files: file_size = os.path.getsize(fileName) fileCount = fileCount + 1 print('fileCount: ', fileCount, genCount, file_size, dateStr, fileName, exceptFileName) if file_size < 8192: continue genCount = genCount + 1
with open(fileName) as f:
content = f.read()
print("content Size is :", len(content))
if len(content) > MAX_SEQ_LENGTH+SEQ_START_IDX:
content = content[SEQ_START_IDX:MAX_SEQ_LENGTH+SEQ_START_IDX]
inputs = tokenizer(
[
f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nAt date {dateStr}, {INSTRUCTION}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{content}<|eot_id|>"""
], return_tensors="pt").to("cuda")
try:
outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
except:
exceptFileName = fileName
continue
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(outputs[0])
You are still using the unsloth model. Convert it first to gguf THEN infer.
Had converted, but don't know how to use llama.cpp like unsloth to do the reference.
@SidneyLann Another option is to use HuggingFace CPU directly after finetuning:
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
"lora_model", # YOUR MODEL YOU USED FOR TRAINING
load_in_4bit = load_in_4bit,
)
tokenizer = AutoTokenizer.from_pretrained("lora_model")
Can't infer by the lora_model which fine tune by gpu, it still use gpu, must fine tune by cpu to do cpu inference?
@SidneyLann You need to save the LoRA adapter (finetuned by CPU or GPU) then load it on a CPU only machine - it should work!
My machine has one gpu to do other tasks, can't do the cpu inference in this machine? Can't config it by a indicator?
@SidneyLann Yes you can set device_map = "cpu"
for example in the loading module to force it to CPU
config = { "model_config": { "finetuned_model": os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen110"), "max_seq_length": 5000, "dtype": torch.float32, "load_in_4bit": True, "device_map": "cpu", } }
model_name=config.get("model_config").get("finetuned_model") device_map=config.get("model_config").get("device_map") model = AutoPeftModelForCausalLM.from_pretrained( model_name, load_in_4bit=config.get("model_config").get("load_in_4bit"), device_map=device_map, ) tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer( [ f"""<|begin_of_text|>......<|eot_id|>""" ], return_tensors = "pt").to(device_map) outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True) outputs = tokenizer.batch_decode(outputs, skip_special_tokens = True) print(outputs[0])
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False
. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting pad_token_id
to eos_token_id
:128001 for open-end generation.
FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first.
......
File "/usr/prg/python/3102/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 468, in forward
out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
File "/usr/prg/python/3102/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 566, in matmul_4bit
assert quant_state is not None
AssertionError
quant_state is None, why .to(device_map) not work?
@SidneyLann Actually you're correct - bitsandbytes only works on GPU :(
Have you considered exporting to GGUF / llama.cpp / Ollama for inference?
Another way is to use load_in_4bit = False
For the same model, gpu 1080 ti spent several minutes and infered one example, cpu i9 spent 1+ hours to wait io to disk(top wa 50%) and pending to infer one example. What's the problem? gpu inference need 12g VRAM, cpu inferenct need 60g RAM?
llama 3.2 work now
cpu i9 spent 30 seconds and gpu 1080 ti spent 10 seconds to infer one example, is it normal? I think gpu should be 10+ times faster then cpu.
import logging import os import json import torch from datasets import load_from_disk from transformers import TrainingArguments from trl import SFTTrainer from unsloth import FastLanguageModel
logger = logging.getLogger(name) logger.setLevel(logging.DEBUG)
DATA_HOME = "/home/sidney/app" MAX_SEQ_LENGTH = 5000
Defining the configuration for the base model, LoRA and training
config = { "hugging_face_username":"Shekswess", "model_config": { "base_model":os.path.join(DATA_HOME, "model_root/model_en"), # The base model "finetuned_model":os.path.join(DATA_HOME, "model_root/model_en/adapters/model_gen"), # The fine-tuned model "max_seq_length": MAX_SEQ_LENGTH, # The maximum sequence length "dtype":torch.float16, # The data type "load_in_4bit": True, # Load the model in 4-bit }, "lora_config": { "r": 16, # The number of LoRA layers 8, 16, 32, 64 "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], # The target modules "lora_alpha":16, # The alpha value for LoRA "lora_dropout":0, # The dropout value for LoRA "bias":"none", # The bias for LoRA "use_gradient_checkpointing":True, # Use gradient checkpointing "use_rslora":False, # Use RSLora "use_dora":False, # Use DoRa "loftq_config":None # The LoFTQ configuration }, "training_dataset":{ "name":os.path.join(DATA_HOME, "dataset_gen"), # The dataset name(huggingface/datasets) "split":"train", # The dataset split "input_field":"prompt", # The input field }, "training_config": { "per_device_train_batch_size": 1, # The batch size "gradient_accumulation_steps": 1, # The gradient accumulation steps "warmup_steps": 5, # The warmup steps "max_steps":0, # The maximum steps (0 if the epochs are defined) "num_train_epochs": 1, # The number of training epochs(0 if the maximum steps are defined) "learning_rate": 2e-4, # The learning rate "fp16": not torch.cuda.is_bf16_supported(), # The fp16 "bf16": torch.cuda.is_bf16_supported(), # The bf16 "logging_steps": 1, # The logging steps "optim" :"adamw_8bit", # The optimizer "weight_decay" : 0.01, # The weight decay "lr_scheduler_type": "linear", # The learning rate scheduler "seed" : 42, # The seed "output_dir" : "outputs", # The output directory } }
Loading the model and the tokinizer for the model
model, tokenizer = FastLanguageModel.from_pretrained( model_name = config.get("model_config").get("base_model"), max_seq_length = config.get("model_config").get("max_seq_length"), dtype = config.get("model_config").get("dtype"), load_in_4bit = config.get("model_config").get("load_in_4bit"), )
Setup for QLoRA/LoRA peft of the base model
model = FastLanguageModel.get_peft_model( model, r = config.get("lora_config").get("r"), target_modules = config.get("lora_config").get("target_modules"), lora_alpha = config.get("lora_config").get("lora_alpha"), lora_dropout = config.get("lora_config").get("lora_dropout"), bias = config.get("lora_config").get("bias"), use_gradient_checkpointing = config.get("lora_config").get("use_gradient_checkpointing"), random_state = 42, use_rslora = config.get("lora_config").get("use_rslora"), use_dora = config.get("lora_config").get("use_dora"), loftq_config = config.get("lora_config").get("loftq_config"), )
Loading the training dataset
dataset_train = load_from_disk(config.get("training_dataset").get("name"))['train']
Setting up the trainer for the model
trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset_train, dataset_text_field = config.get("training_dataset").get("input_field"), max_seq_length = config.get("model_config").get("max_seq_length"), dataset_num_proc = 1, packing = False, args = TrainingArguments( per_device_train_batch_size = config.get("training_config").get("per_device_train_batch_size"), gradient_accumulation_steps = config.get("training_config").get("gradient_accumulation_steps"), warmup_steps = config.get("training_config").get("warmup_steps"), max_steps = config.get("training_config").get("max_steps"), num_train_epochs= config.get("training_config").get("num_train_epochs"), learning_rate = config.get("training_config").get("learning_rate"), fp16 = config.get("training_config").get("fp16"), bf16 = config.get("training_config").get("bf16"), logging_steps = config.get("training_config").get("logging_steps"), optim = config.get("training_config").get("optim"), weight_decay = config.get("training_config").get("weight_decay"), lr_scheduler_type = config.get("training_config").get("lr_scheduler_type"), seed = 42, output_dir = config.get("training_config").get("output_dir"), ), )
Training the model
trainer_stats = trainer.train()
Saving the trainer stats
with open(os.path.join(DATA_HOME, "outputs/trainer_stats_gen.json"), "w") as f: json.dump(trainer_stats, f, indent=4)
Locally saving the model and pushing it to the Hugging Face Hub (only LoRA adapters)
model.save_pretrained(config.get("model_config").get("finetuned_model"))
Can amend this code to use cpu?