Unable to use fine-tuned Llama 3 model on CPU

unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

16.32k stars 1.13k forks source link

Unable to use fine-tuned Llama 3 model on CPU #477

Open code-ksu opened 4 months ago

code-ksu commented 4 months ago

Hello,

I have fine-tuned a Llama 3 model and now I would love to use it on a CPU. I tried to use device_map = 'cpu' when loading the model. However, I am still encountering CUDA issues such as

RuntimeError: CUDA error: an illegal memory access was encountered or my kernel crashing.

After taking a deeper look into the code, I've noticed that many parts are hardwired to use CUDA: https://github.com/search?q=repo%3Aunslothai%2Funsloth+cuda&type=code

Could you provide any tips on how to use my fine-tuned model on the CPU, or let me know if it's not possible?

Thank you!

danielhanchen commented 4 months ago

Oh for inference on CPU only, please use transformers directly - sadly we don't support CPU

code-ksu commented 4 months ago

Thank you for your answer. I already feared that would be the case. I was wondering if it is possible to convert the model I already trained with unsloth into transformers? Or is there a way to import the checkpoints into a compatible transformer model?

erwe324 commented 4 months ago

@code-ksu I believe the model can be loaded directly in to Transformers. Moreover, I dont know your use case but converting to GGUF (llama.cpp) may also help for CPU inference.

dstarcev commented 2 months ago

@code-ksu have you been able to run your model on CPU?

danielhanchen commented 2 months ago

Ye use llama.cpp / GGUF for CPU inference

ApurvPujari commented 3 weeks ago

Ye use llama.cpp / GGUF for CPU inference

Hi could you please provide some code snippets for using llama.cpp ? I have trained on GPU using unsloth. Downloaded lora model weights.

Now I want to inferencing on CPU.. how can I do it ? (I am new to this..)

erwe324 commented 3 weeks ago

Please refer to the llama.cpp repo. They have excellent documentation with loads of examples.

danielhanchen commented 2 weeks ago

Another option is to run inference on the CPU with native transformers with

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
    "lora_model", # YOUR MODEL YOU USED FOR TRAINING
    load_in_4bit = load_in_4bit,
)
tokenizer = AutoTokenizer.from_pretrained("lora_model")

after a finetune