unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
17.24k stars 1.19k forks source link

How to inference with the converted GGUF using llama-cpp? #484

Open mk0223 opened 5 months ago

mk0223 commented 5 months ago

I would appreciate if anyone can help with the following problem when using the converted GGUF for inference.

I found that inferencing with llama-cpp generates a different result from inferencing with the saved LoRA adapters. I am using both Q4 quantized model.

For inferencing with LoRA, I kept the alpaca_prompt format:

if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = True,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        instruct, # instruction
        description, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 8000, use_cache = True,temperature=0)
tokenizer.batch_decode(outputs)

For inferencing with llama-cpp, I used its chat completion since I didn't find a way to retain the alpaca_prompt format:

llm = Llama(
      model_path=SAVED_PATH,
      n_gpu_layers=-1, # Uncomment to use GPU acceleration
      seed=1, # Uncomment to set a specific seed
      n_ctx=2048, # Uncomment to increase the context window
      # tokenizer=LlamaHFTokenizer.from_pretrained(SAVED_PATH) # is this necessary???
)
...
output = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": instruct},
          {"role": "user","content": description}
      ],
      temperature=0,
      max_tokens=8000
)

Is it necessary to retain the alpaca_prompt format or to convert the tokenizers from unsloth to llama-cpp?

In #(https://github.com/abetlen/llama-cpp-python), it is mentioned that: "Due to discrepancies between llama.cpp and HuggingFace's tokenizers, it is required to provide HF Tokenizer for functionary. The LlamaHFTokenizer class can be initialized and passed into the Llama class. This will override the default llama.cpp tokenizer used in Llama class. The tokenizer files are already included in the respective HF repositories hosting the gguf files." I don't quite understand if such descrepancy exist since the Unsloth demo notebook doesn't seem to mention.

Thanks!

danielhanchen commented 5 months ago

A good idea to use llama-cpp's Python module - ill make an example