Output difference between LLaMA-Factory and llama.cpp

anidh commented 4 months ago

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

Hi There, I am observing a difference in output between llama factory inference and llama.cpp.

I am trying to convert a fine tuned microsoft/Phi-3-mini-128k-instruct model which was trained using LoRA. These are briefly the steps which I followed -

Fine-tune the pre trained model and get the fine tuned weight files.
Use the merge script in examples/merge_lora using the command - CUDA_VISIBLE_DEVICES=0 python ../../src/export_model.py --model_name_or_path microsoft/Phi-3-mini-128k-instruct --adapter_name_or_path /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-instruct-Prompt-Dataset-Equal-Sampled-v1-Sharegpt-700Epochs-Fashion/lora/sft/checkpoint-600/ --template default --finetuning_type lora --export_dir /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/ --export_size 8 --export_device cuda --export_legacy_format False
Then we get a combined file which is combination of Trained Model + LoRA adapters. Then i use the command to perform the inference using this combined safetensor weight file. The command used to perform inference is -CUDA_VISIBLE_DEVICES=0 python ../../src/cli_demo.py --model_name_or_path /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/ --template default

The output of the above command when used with a certain prompt is -

tier: None
gender: None
location: NY
generation: genz
category: None
product_type: None

Now, I want to run the above model in the Ollama framework. To do that I need to convert the above combined model to gguf format. I follow the below steps to do this -

I use the convert-hf-to-gguf.py script in llama.cpp to convert the above combined weights to get a gguf file. The command is python convert-hf-to-gguf.py /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New --outfile /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/Phi-3-mini-128k-Fashion-Full-New.gguf
Then using the above generated gguf file I perform the inference from llama.cpp using the command - ./main -m /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/Phi-3-mini-128k-Fashion-Full-New.gguf -ins --interactive-first --temp 0.01 -c 6000 --top-k 50 --top-p 0.7 -n 1024
When passed the same prompt which we passed in the above case i get the output as -
```
{
"tier": "Nano",
"gender": "None",
"location": "NY",
"category": "party wear",
"generation": "genz"
}
```
When we compare this to the above format there is a difference in the outputs of the two models. I have tried this multiple times but always get different outputs.

I have also tried mistral.rs inference framework which uses the safetensors file directly (no GGUF conversion needed, so comparable to the llama factory directly). The output in that case is also similar to the llama.cpp method which makes me believe that I am missing something while doing the inference on the other two frameworks or the llama_factory is using some file/parameters which other frameworks are not using.

My initial suspect is that this is due to the difference in the parameters of the inference. I tried using the python ../../src/cli_demo.py --help command to see all the parameters and values which are used during inference but there are so many parameters and no indications that which parameters among those are used during inference.

Can someone please help me in knowing on how can I solve this issue?

Expected behavior

The expected behaviour is that the output from the llama factory inference should match with the llama.cpp and mistral.rs framework.

System Info

The below is from the llama_factory environment -

transformers version: 4.40.1
Platform: Linux-5.15.0-97-generic-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.3
Accelerate version: 0.29.3
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Parallel

The below is from the llama.cpp environment -

transformers version: 4.40.1
Platform: Linux-5.15.0-97-generic-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: :Parallel

Others

No response

anidh commented 4 months ago

Hi @hiyouga if there is a way to know only inference time parameters that can also be a good starting point for me.

richard28039 commented 2 months ago

Hello, I am doing the same thing, and I had the same issue. Is there any solution? Thanks

hiyouga / LLaMA-Factory