intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.74k stars 1.27k forks source link

phi3 medium - garbage output in webui or generated by ollama #11177

Closed js333031 closed 5 months ago

js333031 commented 5 months ago

In attached, please find output of webui and ollama server console. At line 1 of webui output, I ask the question, using llama3:latest (line 3). Result is shown in lines 4-42

At line 45, I ask same question but using phi3:medium (line 47). Output follows and it's garbage.

 ipex-llm-ollama-server.txt webui_output.txt

sgwhat commented 5 months ago

hi @js333031, we are working on reproducing your issue, could you also please try the following solutions?

  1. Pull phi-3 model by running command ollama pull phi3?
  2. Add prompt template in your ollama modelfile to create the ollama phi-3 model as below:
    FROM ./Phi-3-medium-4k-instruct.gguf
    TEMPLATE """<|user|>
    {{.Prompt}}<|end|>
    <|assistant|>"""
    PARAMETER stop <|end|>
    PARAMETER num_ctx 4096
    PARAMETER num_gpu 33
js333031 commented 5 months ago

I tried that via the webui GUI. I get an error when I click on save & create button: image

sgwhat commented 5 months ago

hi @js333031 , could you please try the following methods:

  1. In your teminal, run ollama pull phi3 to download the model.
  2. If the first solution does not work, please use ollama modelfile https://github.com/intel-analytics/ipex-llm/issues/11177#issuecomment-2141116088 to create the ollama phi3 model from gguf file. more details.
js333031 commented 5 months ago

Please provide complete steps.

On Sun, Jun 2, 2024 at 10:01 PM SONG Ge @.***> wrote:

hi @js333031 https://github.com/js333031 , could you please try the second method to create ollama phi3 model from gguf file?

— Reply to this email directly, view it on GitHub https://github.com/intel-analytics/ipex-llm/issues/11177#issuecomment-2144147189, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABT2PNYB2NIJICPJKWIRG5TZFPFARAVCNFSM6AAAAABIQZ4HVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBUGE2DOMJYHE . You are receiving this because you were mentioned.Message ID: @.***>

sgwhat commented 5 months ago

You may see https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#using-ollama-run-gguf-models for the detailed steps.

Hi @js333031 , I apologize for my mistake, ollama pull phi3 will only pull phi-3-mini model. There is indeed an abnormal output issue with phi-3-medium. You may refer to the modelfile below as a workaround to avoid the abnormal output issue when using gguf to create your ollama model.

FROM Phi-3-medium-4k-instruct-Q4_K_S.gguf

TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"""

PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|end|>"
PARAMETER num_ctx 256
PARAMETER num_gpu 33

Once we resolve the abnormal output issue, will inform you immediately.

flekol commented 5 months ago

Hi @sgwhat , i got the same problem.

And took your comments and the results are the same.

I took the model: Phi-3-medium-4k-instruct-Q5_K_M.gguf, and used your template:

FROM /llm/models/Phi-3-medium-4k-instruct-Q5_K_M.gguf

TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"""

PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|end|>"
PARAMETER num_ctx 256
PARAMETER num_gpu 33

And i also get garbage: image

So i saw that you made added another auto-tokenizer in the recent commit(15a6205790038a6efa0f964e6aee39d42e3a10cd) and the one used here with phi is: tokenizer.ggml.model str = llama

Could this be the issue?

This is my console log (i run it in the docker container)

root@ai:/llm/scripts# bash start-ollama.sh
root@ai:/llm/scripts# 2024/06/03 21:08:59 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-06-03T21:08:59.788+02:00 level=INFO source=images.go:729 msg="total blobs: 0"
time=2024-06-03T21:08:59.788+02:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
time=2024-06-03T21:08:59.788+02:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-06-03T21:08:59.788+02:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1794338369/runners
time=2024-06-03T21:08:59.842+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
time=2024-06-03T21:08:59.844+02:00 level=INFO source=types.go:71 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="62.6 GiB" available="2.6 GiB"

root@ai:/llm/scripts# bash start-open-webui.sh
Cannot determine model snapshot path: Cannot find an appropriate cached snapshot folder for the specified revision on the local disk and outgoing traffic has been disabled. To enable repo look-ups and downloads online, pass 'local_files_only=False' as input.
Traceback (most recent call last):
  File "/llm/open-webui/backend/apps/rag/utils.py", line 396, in get_model_path
    model_repo_path = snapshot_download(**snapshot_kwargs)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/_snapshot_download.py", line 220, in snapshot_download
    raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Cannot find an appropriate cached snapshot folder for the specified revision on the local disk and outgoing traffic has been disabled. To enable repo look-ups and downloads online, pass 'local_files_only=False' as input.
modules.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 349/349 [00:00<00:00, 1.39MB/s]
config_sentence_transformers.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 555kB/s]
README.md: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.7k/10.7k [00:00<00:00, 24.9MB/s]
sentence_bert_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 181kB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 612/612 [00:00<00:00, 3.19MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90.9M/90.9M [00:02<00:00, 44.9MB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 350/350 [00:00<00:00, 1.66MB/s]
vocab.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 1.09MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 1.53MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 565kB/s]
1_Pooling/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:00<00:00, 876kB/s]
INFO:     Started server process [51]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
time=2024-06-03T21:15:04.659+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=33 layers.real=33 memory.available="1.4 GiB" memory.required.full="9.6 GiB" memory.required.partial="7.8 GiB" memory.required.kv="50.0 MiB" memory.weights.total="9.3 GiB" memory.weights.repeating="9.2 GiB" memory.weights.nonrepeating="128.4 MiB" memory.graph.full="33.3 MiB" memory.graph.partial="33.3 MiB"
time=2024-06-03T21:15:04.660+02:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama1794338369/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-5e9d850d6c899e7fdf39a19cdf6fecae225e0c5bb3d13d6f277cbda508a15f0c --ctx-size 256 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 34173"
time=2024-06-03T21:15:04.660+02:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-03T21:15:04.660+02:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-03T21:15:04.663+02:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
llama_model_loader: loaded meta data with 30 key-value pairs and 243 tensors from /root/.ollama/models/blobs/sha256-5e9d850d6c899e7fdf39a19cdf6fecae225e0c5bb3d13d6f277cbda508a15f0c (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 5120
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 17920
llama_model_loader: - kv   6:                           phi3.block_count u32              = 40
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 40
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 10
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 128
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 17
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% for message in messages %}{% if (m...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                      quantize.imatrix.file str              = /models/Phi-3-medium-4k-instruct-GGUF...
llama_model_loader: - kv  27:                   quantize.imatrix.dataset str              = /training_data/calibration_data.txt
llama_model_loader: - kv  28:             quantize.imatrix.entries_count i32              = 160
llama_model_loader: - kv  29:              quantize.imatrix.chunks_count i32              = 234
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q5_K:  101 tensors
llama_model_loader: - type q6_K:   61 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 10
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1280
llm_load_print_meta: n_embd_v_gqa     = 1280
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 17920
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 14B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 13.96 B
llm_load_print_meta: model size       = 9.38 GiB (5.77 BPW)
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
time=2024-06-03T21:15:04.914+02:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
found 4 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.26241|
| 1|     [opencl:gpu:0]|                Intel Arc A770 Graphics|    3.0|    512|    1024|   32| 16225M|       23.35.27191.42|
| 2|     [opencl:cpu:0]|          13th Gen Intel Core i5-13600K|    3.0|     20|    8192|   64| 67175M|2023.16.12.0.12_195853.xmain-hotfix|
| 3|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     20|67108864|   64| 67175M|2023.16.12.0.12_195853.xmain-hotfix|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  9499.15 MiB
llm_load_tensors:        CPU buffer size =   107.64 MiB
llama_new_context_with_model: n_ctx      = 256
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =    50.00 MiB
llama_new_context_with_model: KV self size  =   50.00 MiB, K (f16):   25.00 MiB, V (f16):   25.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.14 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =    85.25 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     5.25 MiB
llama_new_context_with_model: graph nodes  = 1646
llama_new_context_with_model: graph splits = 2
time=2024-06-03T21:15:16.223+02:00 level=INFO source=server.go:571 msg="llama runner started in 11.56 seconds"
sgwhat commented 5 months ago

Hi @flekol , you may try to set export OLLAMA_NUM_GPU=33 before you starting the ollama server, this is a feasible workaround. BTW, may I take a look at your start-ollama.sh?

js333031 commented 5 months ago

I downloaded a different model file and used your template. Still getting garbage:

huggingface-cli download bartowski/Phi-3-medium-4k-instruct-GGUF --include "Phi-3-medium-4k-instruct-Q4_K_M.gguf" --local-dir ./

ollama create example -f ModelFile

Modelfile:

FROM Phi-3-medium-4k-instruct-Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"""

PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|end|>"
PARAMETER num_ctx 256
PARAMETER num_gpu 33
js333031 commented 5 months ago

I also tried Phi-3-medium-4k-instruct-Q4_K_S.gguf and output is garbage also.

sgwhat commented 5 months ago

Hi @js333031 , we have fixed the abnormal output issue and it will be released tonight, you may run the command below to install the latest version of ipex-llm[cpp] tomorrow (version 2.1.0b20240605) and initialize ollama as below:

pip instal l --pre --upgrade ipex-llm[cpp]

init-ollama
flekol commented 5 months ago

@sgwhat thanks a lot, works for me now in the docker container!

btw. which commit did fix this?

js333031 commented 5 months ago

Hi @js333031 , we have fixed the abnormal output issue and it will be released tonight, you may run the command below to install the latest version of ipex-llm[cpp] tomorrow (version 2.1.0b20240605) and initialize ollama as below:

pip instal l --pre --upgrade ipex-llm[cpp]

init-ollama

I'm able to use the model now. Thanks for the fix.