Closed js333031 closed 5 months ago
hi @js333031, we are working on reproducing your issue, could you also please try the following solutions?
ollama pull phi3
? FROM ./Phi-3-medium-4k-instruct.gguf
TEMPLATE """<|user|>
{{.Prompt}}<|end|>
<|assistant|>"""
PARAMETER stop <|end|>
PARAMETER num_ctx 4096
PARAMETER num_gpu 33
I tried that via the webui GUI. I get an error when I click on save & create button:
hi @js333031 , could you please try the following methods:
ollama pull phi3
to download the model.Please provide complete steps.
On Sun, Jun 2, 2024 at 10:01 PM SONG Ge @.***> wrote:
hi @js333031 https://github.com/js333031 , could you please try the second method to create ollama phi3 model from gguf file?
— Reply to this email directly, view it on GitHub https://github.com/intel-analytics/ipex-llm/issues/11177#issuecomment-2144147189, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABT2PNYB2NIJICPJKWIRG5TZFPFARAVCNFSM6AAAAABIQZ4HVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBUGE2DOMJYHE . You are receiving this because you were mentioned.Message ID: @.***>
You may see https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#using-ollama-run-gguf-models for the detailed steps.
Hi @js333031 , I apologize for my mistake, ollama pull phi3
will only pull phi-3-mini
model. There is indeed an abnormal output issue with phi-3-medium
. You may refer to the modelfile below as a workaround to avoid the abnormal output issue when using gguf to create your ollama model.
FROM Phi-3-medium-4k-instruct-Q4_K_S.gguf
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"""
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|end|>"
PARAMETER num_ctx 256
PARAMETER num_gpu 33
Once we resolve the abnormal output issue, will inform you immediately.
Hi @sgwhat , i got the same problem.
And took your comments and the results are the same.
I took the model: Phi-3-medium-4k-instruct-Q5_K_M.gguf, and used your template:
FROM /llm/models/Phi-3-medium-4k-instruct-Q5_K_M.gguf
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"""
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|end|>"
PARAMETER num_ctx 256
PARAMETER num_gpu 33
And i also get garbage:
So i saw that you made added another auto-tokenizer in the recent commit(15a6205790038a6efa0f964e6aee39d42e3a10cd) and the one used here with phi is: tokenizer.ggml.model str = llama
Could this be the issue?
This is my console log (i run it in the docker container)
root@ai:/llm/scripts# bash start-ollama.sh
root@ai:/llm/scripts# 2024/06/03 21:08:59 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-06-03T21:08:59.788+02:00 level=INFO source=images.go:729 msg="total blobs: 0"
time=2024-06-03T21:08:59.788+02:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
time=2024-06-03T21:08:59.788+02:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-06-03T21:08:59.788+02:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1794338369/runners
time=2024-06-03T21:08:59.842+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
time=2024-06-03T21:08:59.844+02:00 level=INFO source=types.go:71 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="62.6 GiB" available="2.6 GiB"
root@ai:/llm/scripts# bash start-open-webui.sh
Cannot determine model snapshot path: Cannot find an appropriate cached snapshot folder for the specified revision on the local disk and outgoing traffic has been disabled. To enable repo look-ups and downloads online, pass 'local_files_only=False' as input.
Traceback (most recent call last):
File "/llm/open-webui/backend/apps/rag/utils.py", line 396, in get_model_path
model_repo_path = snapshot_download(**snapshot_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/_snapshot_download.py", line 220, in snapshot_download
raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Cannot find an appropriate cached snapshot folder for the specified revision on the local disk and outgoing traffic has been disabled. To enable repo look-ups and downloads online, pass 'local_files_only=False' as input.
modules.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 349/349 [00:00<00:00, 1.39MB/s]
config_sentence_transformers.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 555kB/s]
README.md: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.7k/10.7k [00:00<00:00, 24.9MB/s]
sentence_bert_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 181kB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 612/612 [00:00<00:00, 3.19MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90.9M/90.9M [00:02<00:00, 44.9MB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 350/350 [00:00<00:00, 1.66MB/s]
vocab.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 1.09MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 1.53MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 565kB/s]
1_Pooling/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:00<00:00, 876kB/s]
INFO: Started server process [51]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
time=2024-06-03T21:15:04.659+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=33 layers.real=33 memory.available="1.4 GiB" memory.required.full="9.6 GiB" memory.required.partial="7.8 GiB" memory.required.kv="50.0 MiB" memory.weights.total="9.3 GiB" memory.weights.repeating="9.2 GiB" memory.weights.nonrepeating="128.4 MiB" memory.graph.full="33.3 MiB" memory.graph.partial="33.3 MiB"
time=2024-06-03T21:15:04.660+02:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama1794338369/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-5e9d850d6c899e7fdf39a19cdf6fecae225e0c5bb3d13d6f277cbda508a15f0c --ctx-size 256 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 34173"
time=2024-06-03T21:15:04.660+02:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-03T21:15:04.660+02:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-03T21:15:04.663+02:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
llama_model_loader: loaded meta data with 30 key-value pairs and 243 tensors from /root/.ollama/models/blobs/sha256-5e9d850d6c899e7fdf39a19cdf6fecae225e0c5bb3d13d6f277cbda508a15f0c (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi3
llama_model_loader: - kv 1: general.name str = Phi3
llama_model_loader: - kv 2: phi3.context_length u32 = 4096
llama_model_loader: - kv 3: phi3.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 4: phi3.embedding_length u32 = 5120
llama_model_loader: - kv 5: phi3.feed_forward_length u32 = 17920
llama_model_loader: - kv 6: phi3.block_count u32 = 40
llama_model_loader: - kv 7: phi3.attention.head_count u32 = 40
llama_model_loader: - kv 8: phi3.attention.head_count_kv u32 = 10
llama_model_loader: - kv 9: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: phi3.rope.dimension_count u32 = 128
llama_model_loader: - kv 11: phi3.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 12: general.file_type u32 = 17
llama_model_loader: - kv 13: tokenizer.ggml.model str = llama
llama_model_loader: - kv 14: tokenizer.ggml.pre str = default
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32064] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32064] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32064] = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 32000
llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 32000
llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {% for message in messages %}{% if (m...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - kv 26: quantize.imatrix.file str = /models/Phi-3-medium-4k-instruct-GGUF...
llama_model_loader: - kv 27: quantize.imatrix.dataset str = /training_data/calibration_data.txt
llama_model_loader: - kv 28: quantize.imatrix.entries_count i32 = 160
llama_model_loader: - kv 29: quantize.imatrix.chunks_count i32 = 234
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q5_K: 101 tensors
llama_model_loader: - type q6_K: 61 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi3
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32064
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 10
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1280
llm_load_print_meta: n_embd_v_gqa = 1280
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 17920
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 14B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 13.96 B
llm_load_print_meta: model size = 9.38 GiB (5.77 BPW)
llm_load_print_meta: general.name = Phi3
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 32000 '<|endoftext|>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOT token = 32007 '<|end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
time=2024-06-03T21:15:04.914+02:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
found 4 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 1.3| 512| 1024| 32| 16225M| 1.3.26241|
| 1| [opencl:gpu:0]| Intel Arc A770 Graphics| 3.0| 512| 1024| 32| 16225M| 23.35.27191.42|
| 2| [opencl:cpu:0]| 13th Gen Intel Core i5-13600K| 3.0| 20| 8192| 64| 67175M|2023.16.12.0.12_195853.xmain-hotfix|
| 3| [opencl:acc:0]| Intel FPGA Emulation Device| 1.2| 20|67108864| 64| 67175M|2023.16.12.0.12_195853.xmain-hotfix|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
llm_load_tensors: ggml ctx size = 0.28 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: SYCL0 buffer size = 9499.15 MiB
llm_load_tensors: CPU buffer size = 107.64 MiB
llama_new_context_with_model: n_ctx = 256
llama_new_context_with_model: n_batch = 256
llama_new_context_with_model: n_ubatch = 256
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: SYCL0 KV buffer size = 50.00 MiB
llama_new_context_with_model: KV self size = 50.00 MiB, K (f16): 25.00 MiB, V (f16): 25.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.14 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 85.25 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 5.25 MiB
llama_new_context_with_model: graph nodes = 1646
llama_new_context_with_model: graph splits = 2
time=2024-06-03T21:15:16.223+02:00 level=INFO source=server.go:571 msg="llama runner started in 11.56 seconds"
Hi @flekol , you may try to set export OLLAMA_NUM_GPU=33
before you starting the ollama server, this is a feasible workaround. BTW, may I take a look at your start-ollama.sh
?
I downloaded a different model file and used your template. Still getting garbage:
huggingface-cli download bartowski/Phi-3-medium-4k-instruct-GGUF --include "Phi-3-medium-4k-instruct-Q4_K_M.gguf" --local-dir ./
ollama create example -f ModelFile
Modelfile:
FROM Phi-3-medium-4k-instruct-Q4_K_M.gguf
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"""
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|end|>"
PARAMETER num_ctx 256
PARAMETER num_gpu 33
I also tried Phi-3-medium-4k-instruct-Q4_K_S.gguf
and output is garbage also.
Hi @js333031 , we have fixed the abnormal output issue and it will be released tonight, you may run the command below to install the latest version of ipex-llm[cpp]
tomorrow (version 2.1.0b20240605) and initialize ollama as below:
pip instal l --pre --upgrade ipex-llm[cpp]
init-ollama
@sgwhat thanks a lot, works for me now in the docker container!
btw. which commit did fix this?
Hi @js333031 , we have fixed the abnormal output issue and it will be released tonight, you may run the command below to install the latest version of
ipex-llm[cpp]
tomorrow (version 2.1.0b20240605) and initialize ollama as below:pip instal l --pre --upgrade ipex-llm[cpp] init-ollama
I'm able to use the model now. Thanks for the fix.
In attached, please find output of webui and ollama server console. At line 1 of webui output, I ask the question, using llama3:latest (line 3). Result is shown in lines 4-42
At line 45, I ask same question but using phi3:medium (line 47). Output follows and it's garbage.
ipex-llm-ollama-server.txt webui_output.txt