hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
30.28k stars 3.73k forks source link

Output difference between LLaMA-Factory and llama.cpp #3563

Open anidh opened 4 months ago

anidh commented 4 months ago

Reminder

Reproduction

Hi There, I am observing a difference in output between llama factory inference and llama.cpp.

I am trying to convert a fine tuned microsoft/Phi-3-mini-128k-instruct model which was trained using LoRA. These are briefly the steps which I followed -

  1. Fine-tune the pre trained model and get the fine tuned weight files.
  2. Use the merge script in examples/merge_lora using the command - CUDA_VISIBLE_DEVICES=0 python ../../src/export_model.py --model_name_or_path microsoft/Phi-3-mini-128k-instruct --adapter_name_or_path /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-instruct-Prompt-Dataset-Equal-Sampled-v1-Sharegpt-700Epochs-Fashion/lora/sft/checkpoint-600/ --template default --finetuning_type lora --export_dir /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/ --export_size 8 --export_device cuda --export_legacy_format False
  3. Then we get a combined file which is combination of Trained Model + LoRA adapters. Then i use the command to perform the inference using this combined safetensor weight file. The command used to perform inference is -CUDA_VISIBLE_DEVICES=0 python ../../src/cli_demo.py --model_name_or_path /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/ --template default
  4. The output of the above command when used with a certain prompt is -
    tier: None
    gender: None
    location: NY
    generation: genz
    category: None
    product_type: None

Now, I want to run the above model in the Ollama framework. To do that I need to convert the above combined model to gguf format. I follow the below steps to do this -

  1. I use the convert-hf-to-gguf.py script in llama.cpp to convert the above combined weights to get a gguf file. The command is python convert-hf-to-gguf.py /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New --outfile /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/Phi-3-mini-128k-Fashion-Full-New.gguf
  2. Then using the above generated gguf file I perform the inference from llama.cpp using the command - ./main -m /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/Phi-3-mini-128k-Fashion-Full-New.gguf -ins --interactive-first --temp 0.01 -c 6000 --top-k 50 --top-p 0.7 -n 1024
  3. When passed the same prompt which we passed in the above case i get the output as -
    {
    "tier": "Nano",
    "gender": "None",
    "location": "NY",
    "category": "party wear",
    "generation": "genz"
    }

    When we compare this to the above format there is a difference in the outputs of the two models. I have tried this multiple times but always get different outputs.

I have also tried mistral.rs inference framework which uses the safetensors file directly (no GGUF conversion needed, so comparable to the llama factory directly). The output in that case is also similar to the llama.cpp method which makes me believe that I am missing something while doing the inference on the other two frameworks or the llama_factory is using some file/parameters which other frameworks are not using.

My initial suspect is that this is due to the difference in the parameters of the inference. I tried using the python ../../src/cli_demo.py --help command to see all the parameters and values which are used during inference but there are so many parameters and no indications that which parameters among those are used during inference.

Can someone please help me in knowing on how can I solve this issue?

Expected behavior

The expected behaviour is that the output from the llama factory inference should match with the llama.cpp and mistral.rs framework.

System Info

The below is from the llama_factory environment -

The below is from the llama.cpp environment -

Others

No response

anidh commented 4 months ago

Hi @hiyouga if there is a way to know only inference time parameters that can also be a good starting point for me.

richard28039 commented 2 months ago

Hello, I am doing the same thing, and I had the same issue. Is there any solution? Thanks