Open code-ksu opened 4 months ago
Oh for inference on CPU only, please use transformers directly - sadly we don't support CPU
Thank you for your answer. I already feared that would be the case. I was wondering if it is possible to convert the model I already trained with unsloth into transformers? Or is there a way to import the checkpoints into a compatible transformer model?
@code-ksu I believe the model can be loaded directly in to Transformers. Moreover, I dont know your use case but converting to GGUF (llama.cpp) may also help for CPU inference.
@code-ksu have you been able to run your model on CPU?
Ye use llama.cpp / GGUF for CPU inference
Ye use llama.cpp / GGUF for CPU inference
Hi could you please provide some code snippets for using llama.cpp ? I have trained on GPU using unsloth. Downloaded lora model weights.
Now I want to inferencing on CPU.. how can I do it ? (I am new to this..)
Please refer to the llama.cpp repo. They have excellent documentation with loads of examples.
Another option is to run inference on the CPU with native transformers with
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
"lora_model", # YOUR MODEL YOU USED FOR TRAINING
load_in_4bit = load_in_4bit,
)
tokenizer = AutoTokenizer.from_pretrained("lora_model")
after a finetune
Hello,
I have fine-tuned a Llama 3 model and now I would love to use it on a CPU. I tried to use
device_map = 'cpu'
when loading the model. However, I am still encountering CUDA issues such asAfter taking a deeper look into the code, I've noticed that many parts are hardwired to use CUDA: https://github.com/search?q=repo%3Aunslothai%2Funsloth+cuda&type=code
Could you provide any tips on how to use my fine-tuned model on the CPU, or let me know if it's not possible?
Thank you!