intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.56k stars 1.25k forks source link

Looking for a workaround to install IPEX-LLM on Windows with an Intel GPU but running with a CPU and not running with a GPU #11401

Closed ChordNT closed 1 month ago

ChordNT commented 3 months ago

All I need is to run ollama3 on an Intel GPU (Arc™ A750) and I follow the steps as described in the IPEX-LLM documentation, but it runs on the CPU. Search engines can't find a solution to the problem. Is there a big guy to see where the problem is, thank you.

Here are the steps I followed in the quickstart of the official IPEX-LLM Document

1 . Install IPEX-LLM on Windows with Intel GPU

1.1Setup Python Environment

  1. conda create -n llm python=3.11 libuv
  2. conda activate llm
  3. pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

2. Run llama.cpp with IPEX-LLM on Intel GPU

Instead of executing conda create -n llm-cpp python=3.11 conda activate llm-cpp as stated in the document, I directly use the llm virtual environment in step 1.

  1. pip install --pre --upgrade ipex-llm[cpp]

  2. mkdir llama-cpp

  3. cd llama-cpp

  4. init-llama-cpp.bat

3. Run Llama 3 on Intel GPU using llama.cpp and ollama with IPEX-LLM

3.1 Run Llama3 using Ollama

3.1.1 Run Ollama Serve

set OLLAMA_NUM_GPU=999 set no_proxy=localhost,127.0.0.1 set ZES_ENABLE_SYSMAN=1 set SYCL_CACHE_PERSISTENT=1

ollama serve

1 2
TriDefender commented 3 months ago

try adding this after loading the model; model=model.to('xpu') or else it will remain on the cpu and will not move to gpu

sgwhat commented 3 months ago

Hi @Fucalors ,

  1. Could you please provide the complete runtime log of the Ollama server side during model inference?
  2. Could you please run ls-sycl-device.exe and reply us the output?
ChordNT commented 3 months ago

嗨,

  1. 您能否在模型推理期间提供 Ollama 服务器端的完整运行时日志?
  2. 您能否运行并回复我们输出?ls-sycl-device.exe

Are you talking about these two logs?

屏幕截图 2024-06-24 151005 屏幕截图 2024-06-24 152258
sgwhat commented 3 months ago

Hi @Fucalors, I don't think you are running ipex-llm ollama. Please double-check your environment and installation method. You may refer to our documentation at https://ipex-llm-latest.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html for installing Ollama.

ayttop commented 1 month ago

OLLAMA_INTEL_GPU:false?!!!!!!!!!!!!!!!!!!!!!!!!!!!

(1) C:\Users\ArabTech\Desktop\1>ollama serve 2024/08/27 17:15:31 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\ArabTech\.ollama\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0: app:// file:// tauri://*] OLLAMA_RUNNERS_DIR:C:\Users\ArabTech\AppData\Local\Programs\Ollama\lib\ollama\runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-08-27T17:15:31.504-07:00 level=INFO source=images.go:753 msg="total blobs: 17" time=2024-08-27T17:15:31.505-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" time=2024-08-27T17:15:31.506-07:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.7)" time=2024-08-27T17:15:31.506-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v12 rocm_v6.1 cpu cpu_avx cpu_avx2 cuda_v11]" time=2024-08-27T17:15:31.506-07:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs" time=2024-08-27T17:15:31.514-07:00 level=INFO source=gpu.go:347 msg="no compatible GPUs were discovered" time=2024-08-27T17:15:31.514-07:00 level=INFO source=types.go:107 msg="inference compute" id=0 library=cpu variant=avx2 compute="" driver=0.0 name="" total="63.8 GiB" available="53.0 G

ayttop commented 1 month ago

(base) C:\Windows\System32>conda activate 1

(1) C:\Windows\System32>cd "C:\Users\ArabTech\Desktop\1"

(1) C:\Users\ArabTech\Desktop\1>ollama Usage: ollama [flags] ollama [command]

Available Commands: serve Start ollama create Create a model from a Modelfile show Show information for a model run Run a model pull Pull a model from a registry push Push a model to a registry list List models ps List running models cp Copy a model rm Remove a model help Help about any command

Flags: -h, --help help for ollama -v, --version Show version information

Use "ollama [command] --help" for more information about a command.

(1) C:\Users\ArabTech\Desktop\1>set OLLAMA_NUM_GPU=999

(1) C:\Users\ArabTech\Desktop\1>set no_proxy=localhost,127.0.0.1

(1) C:\Users\ArabTech\Desktop\1>set ZES_ENABLE_SYSMAN=1

(1) C:\Users\ArabTech\Desktop\1>set SYCL_CACHE_PERSISTENT=1

(1) C:\Users\ArabTech\Desktop\1> (1) C:\Users\ArabTech\Desktop\1>ollama serve 2024/08/27 17:15:31 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\ArabTech\.ollama\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0: app:// file:// tauri://*] OLLAMA_RUNNERS_DIR:C:\Users\ArabTech\AppData\Local\Programs\Ollama\lib\ollama\runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-08-27T17:15:31.504-07:00 level=INFO source=images.go:753 msg="total blobs: 17" time=2024-08-27T17:15:31.505-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" time=2024-08-27T17:15:31.506-07:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.7)" time=2024-08-27T17:15:31.506-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v12 rocm_v6.1 cpu cpu_avx cpu_avx2 cuda_v11]" time=2024-08-27T17:15:31.506-07:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs" time=2024-08-27T17:15:31.514-07:00 level=INFO source=gpu.go:347 msg="no compatible GPUs were discovered" time=2024-08-27T17:15:31.514-07:00 level=INFO source=types.go:107 msg="inference compute" id=0 library=cpu variant=avx2 compute="" driver=0.0 name="" total="63.8 GiB" available="53.0 GiB" [GIN] 2024/08/27 - 17:17:58 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/08/27 - 17:17:58 | 200 | 507.7µs | 127.0.0.1 | GET "/api/tags" [GIN] 2024/08/27 - 17:18:12 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/08/27 - 17:18:12 | 200 | 7.5954ms | 127.0.0.1 | POST "/api/show" time=2024-08-27T17:18:12.496-07:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[53.0 GiB]" memory.required.full="4.6 GiB" memory.required.partial="0 B" memory.required.kv="2.5 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="3.8 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="102.8 MiB" memory.graph.full="548.0 MiB" memory.graph.partial="543.0 MiB" time=2024-08-27T17:18:12.499-07:00 level=INFO source=server.go:391 msg="starting llama server" cmd="C:\Users\ArabTech\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe --model C:\Users\ArabTech\.ollama\models\blobs\sha256-04778965089b91318ad61d0995b7e44fad4b9a9f4e049d7be90932bf8812e828 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 56775" time=2024-08-27T17:18:12.501-07:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2024-08-27T17:18:12.501-07:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2024-08-27T17:18:12.501-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3535 commit="1e6f6554" tid="2600" timestamp=1724804292 INFO [wmain] system info | n_threads=14 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="2600" timestamp=1724804292 total_threads=28 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="27" port="56775" tid="2600" timestamp=1724804292 llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from C:\Users\ArabTech.ollama\models\blobs\sha256-04778965089b91318ad61d0995b7e44fad4b9a9f4e049d7be90932bf8812e828 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = phi2 llama_model_loader: - kv 1: general.name str = Phi2 llama_model_loader: - kv 2: phi2.context_length u32 = 2048 llama_model_loader: - kv 3: phi2.embedding_length u32 = 2560 llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 10240 llama_model_loader: - kv 5: phi2.block_count u32 = 32 llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32 llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32 llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256 llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - type f32: 195 tensors llama_model_loader: - type q4_0: 129 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default' llm_load_vocab: special tokens cache size = 944 llm_load_vocab: token to piece cache size = 0.3151 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = phi2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 51200 llm_load_print_meta: n_merges = 50000 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2560 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 80 llm_load_print_meta: n_embd_head_v = 80 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 2560 llm_load_print_meta: n_embd_v_gqa = 2560 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 10240 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 2.78 B llm_load_print_meta: model size = 1.49 GiB (4.61 BPW) llm_load_print_meta: general.name = Phi2 llm_load_print_meta: BOS token = 50256 '<|endoftext|>' llm_load_print_meta: EOS token = 50256 '<|endoftext|>' llm_load_print_meta: UNK token = 50256 '<|endoftext|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 50256 '<|endoftext|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.15 MiB llm_load_tensors: CPU buffer size = 1526.50 MiB time=2024-08-27T17:18:12.756-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 2560.00 MiB llama_new_context_with_model: KV self size = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB llama_new_context_with_model: CPU output buffer size = 0.82 MiB llama_new_context_with_model: CPU compute buffer size = 563.01 MiB llama_new_context_with_model: graph nodes = 1225 llama_new_context_with_model: graph splits = 1 INFO [wmain] model loaded | tid="2600" timestamp=1724804294 time=2024-08-27T17:18:14.547-07:00 level=INFO source=server.go:630 msg="llama runner started in 2.05 seconds" [GIN] 2024/08/27 - 17:18:14 | 200 | 2.0620629s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/08/27 - 17:18:25 | 200 | 1.573856s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/08/27 - 17:18:51 | 200 | 5.172217s | 127.0.0.1 | POST "/api/chat" time=2024-08-27T17:19:22.690-07:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=32 layers.model=33 layers.offload=0 layers.split="" memory.available="[53.0 GiB]" memory.required.full="4.5 GiB" memory.required.partial="0 B" memory.required.kv="2.5 GiB" memory.required.allocations="[4.5 GiB]" memory.weights.total="3.8 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="102.8 MiB" memory.graph.full="548.0 MiB" memory.graph.partial="543.0 MiB" time=2024-08-27T17:19:22.692-07:00 level=INFO source=server.go:391 msg="starting llama server" cmd="C:\Users\ArabTech\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe --model C:\Users\ArabTech\.ollama\models\blobs\sha256-04778965089b91318ad61d0995b7e44fad4b9a9f4e049d7be90932bf8812e828 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 32 --no-mmap --parallel 4 --port 56792" time=2024-08-27T17:19:22.693-07:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2024-08-27T17:19:22.693-07:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2024-08-27T17:19:22.693-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="13448" timestamp=1724804362 INFO [wmain] build info | build=3535 commit="1e6f6554" tid="13448" timestamp=1724804362 INFO [wmain] system info | n_threads=14 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="13448" timestamp=1724804362 total_threads=28 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="27" port="56792" tid="13448" timestamp=1724804362 llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from C:\Users\ArabTech.ollama\models\blobs\sha256-04778965089b91318ad61d0995b7e44fad4b9a9f4e049d7be90932bf8812e828 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = phi2 llama_model_loader: - kv 1: general.name str = Phi2 llama_model_loader: - kv 2: phi2.context_length u32 = 2048 llama_model_loader: - kv 3: phi2.embedding_length u32 = 2560 llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 10240 llama_model_loader: - kv 5: phi2.block_count u32 = 32 llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32 llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32 llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256 llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - type f32: 195 tensors llama_model_loader: - type q4_0: 129 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default' llm_load_vocab: special tokens cache size = 944 llm_load_vocab: token to piece cache size = 0.3151 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = phi2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 51200 llm_load_print_meta: n_merges = 50000 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2560 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 80 llm_load_print_meta: n_embd_head_v = 80 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 2560 llm_load_print_meta: n_embd_v_gqa = 2560 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 10240 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 2.78 B llm_load_print_meta: model size = 1.49 GiB (4.61 BPW) llm_load_print_meta: general.name = Phi2 llm_load_print_meta: BOS token = 50256 '<|endoftext|>' llm_load_print_meta: EOS token = 50256 '<|endoftext|>' llm_load_print_meta: UNK token = 50256 '<|endoftext|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 50256 '<|endoftext|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.15 MiB llm_load_tensors: CPU buffer size = 1526.50 MiB time=2024-08-27T17:19:22.946-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 2560.00 MiB llama_new_context_with_model: KV self size = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB llama_new_context_with_model: CPU output buffer size = 0.82 MiB llama_new_context_with_model: CPU compute buffer size = 563.01 MiB llama_new_context_with_model: graph nodes = 1225 llama_new_context_with_model: graph splits = 1 INFO [wmain] model loaded | tid="13448" timestamp=1724804363 time=2024-08-27T17:19:23.815-07:00 level=INFO source=server.go:630 msg="llama runner started in 1.12 seconds" [GIN] 2024/08/27 - 17:19:31 | 200 | 9.1209986s | 127.0.0.1 | POST "/api/chat"

ayttop commented 1 month ago

it is run very gooooooooooooooooood

Use "ollama [command] --help" for more information about a command.

(1) C:\Users\ArabTech\Desktop\1\ipex-llm\dist\windows-amd64\lib\ollama\runners\cpu_avx2>ollama list NAME ID SIZE MODIFIED phi:latest e2fd6321a5fe 1.6 GB 7 hours ago mxbai-embed-large:latest 468836162de7 669 MB 26 hours ago nomic-embed-text:latest 0a109f422b47 274 MB 26 hours ago llama-3.1-8b-lexi-uncensored-v2-q8_0.gguf:latest 0bfa6ffcece4 8.5 GB 27 hours ago

(1) C:\Users\ArabTech\Desktop\1\ipex-llm\dist\windows-amd64\lib\ollama\runners\cpu_avx2>ollama serve Error: listen tcp 127.0.0.1:11434: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.

(1) C:\Users\ArabTech\Desktop\1\ipex-llm\dist\windows-amd64\lib\ollama\runners\cpu_avx2>cd C:\Users\ArabTech\Desktop\1\ipex-llm\

(1) C:\Users\ArabTech\Desktop\1\ipex-llm>set OLLAMA_NUM_GPU=999

(1) C:\Users\ArabTech\Desktop\1\ipex-llm>set no_proxy=localhost,127.0.0.1

(1) C:\Users\ArabTech\Desktop\1\ipex-llm>set ZES_ENABLE_SYSMAN=1

(1) C:\Users\ArabTech\Desktop\1\ipex-llm>set SYCL_CACHE_PERSISTENT=1

(1) C:\Users\ArabTech\Desktop\1\ipex-llm>set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

(1) C:\Users\ArabTech\Desktop\1\ipex-llm> (1) C:\Users\ArabTech\Desktop\1\ipex-llm>ollama serve 2024/08/27 18:13:12 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\ArabTech\.ollama\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0: app:// file:// tauri://*] OLLAMA_RUNNERS_DIR:C:\Users\ArabTech\Desktop\1\ipex-llm\dist\windows-amd64\lib\ollama\runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-08-27T18:13:12.381-07:00 level=INFO source=images.go:753 msg="total blobs: 17" time=2024-08-27T18:13:12.383-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.

[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(Server).EmbedHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(Server).EmbedHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(Server).ShowModelHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func2 (5 handlers) time=2024-08-27T18:13:12.389-07:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.6-ipexllm-20240827)" time=2024-08-27T18:13:12.389-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cpu]" [GIN] 2024/08/27 - 18:15:08 200 0s 127.0.0.1 HEAD "/" [GIN] 2024/08/27 - 18:15:08 200 1.1003ms 127.0.0.1 GET "/api/tags" [GIN] 2024/08/27 - 18:15:23 200 0s 127.0.0.1 HEAD "/" [GIN] 2024/08/27 - 18:15:23 200 3.5413ms 127.0.0.1 POST "/api/show" time=2024-08-27T18:15:23.811-07:00 level=INFO source=gpu.go:168 msg="looking for compatible GPUs" time=2024-08-27T18:15:23.816-07:00 level=INFO source=gpu.go:280 msg="no compatible GPUs were discovered" time=2024-08-27T18:15:23.822-07:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[52.9 GiB]" memory.required.full="4.6 GiB" memory.required.partial="0 B" memory.required.kv="2.5 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="3.8 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="102.8 MiB" memory.graph.full="548.0 MiB" memory.graph.partial="543.0 MiB" time=2024-08-27T18:15:23.826-07:00 level=INFO source=server.go:395 msg="starting llama server" cmd="C:\Users\ArabTech\Desktop\1\ipex-llm\dist\windows-amd64\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe --model C:\Users\ArabTech\.ollama\models\blobs\sha256-04778965089b91318ad61d0995b7e44fad4b9a9f4e049d7be90932bf8812e828 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --no-mmap --parallel 4 --port 58132" time=2024-08-27T18:15:23.827-07:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2024-08-27T18:15:23.827-07:00 level=INFO source=server.go:595 msg="waiting for llama runner to start responding" time=2024-08-27T18:15:23.828-07:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info build=1 commit="6f4ec98" tid="10164" timestamp=1724807723 INFO [wmain] system info n_threads=20 n_threads_batch=-1 system_info="AVX = 0 AVX_VNNI = 0 AVX2 = 0 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 AVX512_BF16 = 0 FMA = 0 NEON = 0 SVE = 0 ARM_FMA = 0 F16C = 0 FP16_VA = 0 WASM_SIMD = 0 BLAS = 1 SSE3 = 0 SSSE3 = 0 VSX = 0 MATMUL_INT8 = 0 LLAMAFILE = 1 " tid="10164" timestamp=1724807723 total_threads=28 INFO [wmain] HTTP server listening hostname="127.0.0.1" n_threads_http="27" port="58132" tid="10164" timestamp=1724807723 llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from C:\Users\ArabTech.ollama\models\blobs\sha256-04778965089b91318ad61d0995b7e44fad4b9a9f4e049d7be90932bf8812e828 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = phi2 llama_model_loader: - kv 1: general.name str = Phi2 llama_model_loader: - kv 2: phi2.context_length u32 = 2048 llama_model_loader: - kv 3: phi2.embedding_length u32 = 2560 llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 10240 llama_model_loader: - kv 5: phi2.block_count u32 = 32 llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32 llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32 llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256 llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - type f32: 195 tensors llama_model_loader: - type q4_0: 129 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: missing pre-tokenizer type, using: 'default' llm_load_vocab: llm_load_vocab: **** llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED! llm_load_vocab: CONSIDER REGENERATING THE MODEL llm_load_vocab: **** llm_load_vocab: llm_load_vocab: special tokens cache size = 944 llm_load_vocab: token to piece cache size = 0.3151 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = phi2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 51200 llm_load_print_meta: n_merges = 50000 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2560 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 80 llm_load_print_meta: n_embd_head_v = 80 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 2560 llm_load_print_meta: n_embd_v_gqa = 2560 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 10240 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 2.78 B llm_load_print_meta: model size = 1.49 GiB (4.61 BPW) llm_load_print_meta: general.name = Phi2 llm_load_print_meta: BOS token = 50256 '< endoftext >' llm_load_print_meta: EOS token = 50256 '< endoftext >' llm_load_print_meta: UNK token = 50256 '< endoftext >' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 50256 '< endoftext >' llm_load_print_meta: max token length = 256 time=2024-08-27T18:15:24.083-07:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server loading model" ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no ggml_sycl_init: SYCL_USE_XMX: yes ggml_sycl_init: found 1 SYCL devices: llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: SYCL0 buffer size = 1456.19 MiB llm_load_tensors: SYCL_Host buffer size = 70.31 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 [SYCL] call ggml_check_sycl ggml_check_sycl: GGML_SYCL_DEBUG: 0 ggml_check_sycl: GGML_SYCL_F16: no found 1 SYCL devices: Max Max Global compute Max work sub mem ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel UHD Graphics 770 1.5 32 512 32 31709M 1.3.30398
llama_kv_cache_init: SYCL0 KV buffer size = 2560.00 MiB llama_new_context_with_model: KV self size = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 0.82 MiB llama_new_context_with_model: SYCL0 compute buffer size = 603.00 MiB llama_new_context_with_model: SYCL_Host compute buffer size = 21.01 MiB llama_new_context_with_model: graph nodes = 1257 llama_new_context_with_model: graph splits = 2 INFO [wmain] model loaded tid="10164" timestamp=1724807728 time=2024-08-27T18:15:28.364-07:00 level=INFO source=server.go:634 msg="llama runner started in 4.54 seconds" [GIN] 2024/08/27 - 18:15:28 200 4.5565764s 127.0.0.1 POST "/api/chat" INFO [print_timings] prompt eval time = 752.82 ms / 39 tokens ( 19.30 ms per token, 51.81 tokens per second) n_prompt_tokens_processed=39 n_tokens_second=51.80555647801916 slot_id=0 t_prompt_processing=752.815 t_token=19.30294871794872 task_id=4 tid="10164" timestamp=1724807876 INFO [print_timings] generation eval time = 8637.51 ms / 77 runs ( 112.18 ms per token, 8.91 tokens per second) n_decoded=77 n_tokens_second=8.914607209092344 slot_id=0 t_token=112.17544155844156 t_token_generation=8637.509 task_id=4 tid="10164" timestamp=1724807876 INFO [print_timings] total time = 9390.32 ms slot_id=0 t_prompt_processing=752.815 t_token_generation=8637.509 t_total=9390.324 task_id=4 tid="10164" timestamp=1724807876 [GIN] 2024/08/27 - 18:17:56 200 9.3988916s 127.0.0.1 POST "/api/chat" [GIN] 2024/08/27 - 18:18:34 200 0s 127.0.0.1 HEAD "/" [GIN] 2024/08/27 - 18:18:34 200 9.3004ms 127.0.0.1 POST "/api/show" time=2024-08-27T18:18:34.995-07:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=32 layers.model=33 layers.offload=0 layers.split="" memory.available="[47.9 GiB]" memory.required.full="7.5 GiB" memory.required.partial="0 B" memory.required.kv="256.0 MiB" memory.required.allocations="[7.5 GiB]" memory.weights.total="7.2 GiB" memory.weights.repeating="6.6 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" time=2024-08-27T18:18:35.001-07:00 level=INFO source=server.go:395 msg="starting llama server" cmd="C:\Users\ArabTech\Desktop\1\ipex-llm\dist\windows-amd64\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe --model C:\Users\ArabTech\.ollama\models\blobs\sha256-20ee18469ac48c875af10c8f970b0a5371c73c7109bfdd3835615777f75bf26b --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --no-mmap --parallel 4 --port 58198" time=2024-08-27T18:18:35.002-07:00 level=INFO source=sched.go:450 msg="loaded runners" count=2 time=2024-08-27T18:18:35.002-07:00 level=INFO source=server.go:595 msg="waiting for llama runner to start responding" time=2024-08-27T18:18:35.002-07:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info build=1 commit="6f4ec98" tid="16608" timestamp=1724807915 INFO [wmain] system info n_threads=20 n_threads_batch=-1 system_info="AVX = 0 AVX_VNNI = 0 AVX2 = 0 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 AVX512_BF16 = 0 FMA = 0 NEON = 0 SVE = 0 ARM_FMA = 0 F16C = 0 FP16_VA = 0 WASM_SIMD = 0 BLAS = 1 SSE3 = 0 SSSE3 = 0 VSX = 0 MATMUL_INT8 = 0 LLAMAFILE = 1 " tid="16608" timestamp=1724807915 total_threads=28 INFO [wmain] HTTP server listening hostname="127.0.0.1" n_threads_http="27" port="58198" tid="16608" timestamp=1724807915 llama_model_loader: loaded meta data with 30 key-value pairs and 292 tensors from C:\Users\ArabTech.ollama\models\blobs\sha256-20ee18469ac48c875af10c8f970b0a5371c73c7109bfdd3835615777f75bf26b (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8b Instruct llama_model_loader: - kv 3: general.version str = V2 llama_model_loader: - kv 4: general.organization str = Unsloth llama_model_loader: - kv 5: general.finetune str = instruct llama_model_loader: - kv 6: general.basename str = meta-llama-3.1 llama_model_loader: - kv 7: general.size_label str = 8B llama_model_loader: - kv 8: general.license str = llama3.1 llama_model_loader: - kv 9: llama.block_count u32 = 32 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 7 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type q8_0: 226 tensors llm_load_vocab: special tokens cache size = 256 time=2024-08-27T18:18:35.253-07:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 7.95 GiB (8.50 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 8b Instruct llm_load_print_meta: BOS token = 128000 '< begin_of_text >' llm_load_print_meta: EOS token = 128009 '< eot_id >' llm_load_print_meta: PAD token = 128004 '< finetune_right_pad_id >' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '< eot_id >' llm_load_print_meta: max token length = 256 ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no ggml_sycl_init: SYCL_USE_XMX: yes ggml_sycl_init: found 1 SYCL devices: llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: SYCL0 buffer size = 7605.34 MiB llm_load_tensors: SYCL_Host buffer size = 532.31 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 [SYCL] call ggml_check_sycl ggml_check_sycl: GGML_SYCL_DEBUG: 0 ggml_check_sycl: GGML_SYCL_F16: no found 1 SYCL devices: Max Max Global compute Max work sub mem ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel UHD Graphics 770 1.5 32 512 32 31709M 1.3.30398

llama_kv_cache_init: SYCL0 KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 2.02 MiB llama_new_context_with_model: SYCL0 compute buffer size = 258.50 MiB llama_new_context_with_model: SYCL_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1062 llama_new_context_with_model: graph splits = 2 INFO [wmain] model loaded | tid="16608" timestamp=1724807924 time=2024-08-27T18:18:44.565-07:00 level=INFO source=server.go:634 msg="llama runner started in 9.56 seconds" [GIN] 2024/08/27 - 18:18:44 | 200 | 9.5933075s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/08/27 - 18:21:10 | 200 | 2m2s | 127.0.0.1 | POST "/api/chat"

ChordNT commented 1 month ago

运行得非常好

使用 “ollama [command] --help” 了解有关命令的更多信息。

(1) C:\Users\ArabTech\Desktop\1\ipex-llm\dist\windows-amd64\lib\ollama\runners\cpu_avx2>ollama list 名称 ID 大小已修改 phi:最新 e2fd6321a5fe 1.6 GB 7 小时前 mxbai-embed-large:最新 468836162de7 669 MB 26 小时前 nomic-embed-text:最新 0a109f422b47 274 MB 26 小时前 llama-3.1-8b-lexi-uncensored-v2-q8_0.gguf:最新 0bfa6ffcece4 8.5 GB 27 小时前

(1) C:\Users\ArabTech\Desktop\1\ipex-llm\dist\windows-amd64\lib\ollama\runners\cpu_avx2>ollama serve 错误:listen tcp 127.0.0.1:11434:bind:通常只允许每个套接字地址(协议/网络地址/端口)使用一次。

(1) C:\Users\ArabTech\Desktop\1\ipex-llm\dist\windows-amd64\lib\ollama\runners\cpu_avx2>cd C:\Users\ArabTech\Desktop\1\ipex-llm\

(1) C:\Users\ArabTech\Desktop\1\ipex-llm>set OLLAMA_NUM_GPU=999

(1) C:\Users\ArabTech\Desktop\1\ipex-llm>set no_proxy=localhost,127.0.0.1

(1) C:\Users\ArabTech\Desktop\1\ipex-llm>set ZES_ENABLE_SYSMAN=1

(1) C:\Users\ArabTech\Desktop\1\ipex-llm>set SYCL_CACHE_PERSISTENT=1

(1) C:\Users\ArabTech\Desktop\1\ipex-llm>set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

(1) C:\Users\ArabTech\Desktop\1\ipex-llm> (1) C:\Users\ArabTech\Desktop\1\ipex-llm>ollama serve 2024/08/27 18:13:12 routes.go:1125: INFO 服务器配置 env=“map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\ArabTech.ollama\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost https://localhost http://127.0.0.1 https://127.0.0.1 http://127.0.0.1 https://127.0.0.1 http://0.0.0.0 https://0.0.0.0 http://0.0.0.0 https://0.0.0.0 app:// file:// tauri://*] OLLAMA_RUNNERS_DIR:C:\Users\ArabTech\Desktop\1\ipex-llm\dist\windows-amd64\lib\ollama\runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]“ time=2024-08-27T18:13:12.381-07:00 level=INFO source=images.go:753 msg=”总 blobs: 17“ time=2024-08-27T18:13:12.383-07:00 level=INFO source=images.go:760 msg=”已删除的未使用 blob 总数: 0“ [GIN-debug] [警告] 创建已附加记录器和恢复中间件的引擎实例。

[GIN 调试][警告]在 “debug” 模式下运行。在生产环境中切换到 “release” 模式。

  • 使用环境:export GIN_MODE=release
  • 使用代码:gin。SetMode(gin.ReleaseMode 的 ReleaseMode

[GIN 调试]POST /api/pull --> github.com/ollama/ollama/server 的(服务器)。PullModelHandler-fm(5 个处理程序)[GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(服务器)。GenerateHandler-fm (5 个处理程序) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(服务器)。ChatHandler-fm (5 个处理程序) [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(服务器)。EmbedHandler-fm(5 个处理程序) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(服务器)。EmbeddingsHandler-fm(5 个处理程序) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(服务器)。CreateModelHandler-fm (5 个处理程序) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(服务器)。PushModelHandler-fm(5 个处理程序) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(服务器)。CopyModelHandler-fm (5 个处理程序) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(服务器)。DeleteModelHandler-fm(5 个处理程序) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(服务器)。ShowModelHandler-fm (5 个处理程序) [GIN-debug] POST /api/blobs/:d igest --> github.com/ollama/ollama/server.(服务器)。CreateBlobHandler-fm(5 个处理程序) [GIN-debug] HEAD /api/blobs/:d igest --> github.com/ollama/ollama/server.(服务器)。HeadBlobHandler-fm(5 个处理程序)[GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(服务器)。ProcessHandler-fm (5 个处理程序) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(服务器)。ChatHandler-fm (6 个处理程序) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(服务器)。GenerateHandler-fm (6 个处理程序) [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(服务器)。EmbedHandler-fm (6 个处理程序) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(服务器)。ListModelsHandler-fm (6 个处理程序) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(服务器)。ShowModelHandler-fm (6 个处理程序) [GIN-debug] GET / --> github.com/ollama/ollama/server.(服务器)。GenerateRoutes.func1(5 个处理程序) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server。(服务器)。ListModelsHandler-fm (5 个处理程序) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(服务器)。GenerateRoutes.func2(5 个处理程序) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(服务器)。GenerateRoutes.func1(5 个处理程序) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(服务器)。ListModelsHandler-fm (5 个处理程序) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(服务器)。GenerateRoutes.func2 (5 个处理程序) time=2024-08-27T18:13:12.389-07:00 level=INFO source=routes.go:1172 msg=“正在侦听 127.0.0.1:11434(版本 0.3.6-ipexllm-20240827)” time=2024-08-27T18:13:12.389-07:00 level=INFO source=payload.go:44 msg=“动态 LLM 库 [cpu_avx cpu_avx2 cpu]” [GIN] 2024/08/27 - 18:15:08 |200 元 |0 秒 |127.0.0.1 版本 |头部 “/” [杜松子酒] 2024/08/27 - 18:15:08 |200 元 |1.1003 毫秒 |127.0.0.1 版本 |获取 “/api/tags” [GIN] 2024/08/27 - 18:15:23 |200 元 |0 秒 |127.0.0.1 版本 |头部 “/” [杜松子酒] 2024/08/27 - 18:15:23 |200 元 |3.5413 毫秒 |127.0.0.1 版本 |POST “/api/show” time=2024-08-27T18:15:23.811-07:00 level=INFO source=gpu.go:168 msg=“寻找兼容的 GPU” time=2024-08-27T18:15:23.816-07:00 level=INFO source=gpu.go:280 msg=“未发现兼容的 GPU” time=2024-08-27T18:15:23.822-07:00 level=INFO source=memory.go:309 msg=“卸载到 cpu” layers.requested=-1 layers.model=33 layers.offload=0 layers.split=“” memory.available=“[52.9 GiB]” memory.required.full=“4.6 GiB”memory.required.partial=“0 B” memory.required.kv=“2.5 GiB” memory.required.allocations=“[4.6 GiB]” memory.weights.total=“3.8 GiB” memory.weights.repeating=“3.7 GiB” memory.weights.nonrepeating=“102.8 MiB” memory.graph.full=“548.0 MiB” memory.graph.partial=“543.0 MiB” time=2024-08-27T18:15:23.826-07:00 level=INFO source=server.go:395 msg=“启动美洲驼服务器” cmd=“C:\Users\ArabTech\Desktop\1\ipex-llm\dist\windows-amd64\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe --model C:\Users\ArabTech..model ollama\models\blobs\sha256-04778965089b91318ad61d0995b7e44fad4b9a9f4e049d7be90932bf8812e828 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --no-mmap --parallel 4 --port 58132“ time=2024-08-27T18:15:23.827-07:00 level=INFO source=sched.go:450 msg=”loaded runners“ count=1 time=2024-08-27T18:15:23.827-07:00 level=INFO source=server.go:595 msg=”等待羊驼跑步者启动响应“ 时间=2024-08-27T18:15:23.828-07:00 level=INFO source=server.go:629 msg=”等待服务器可用“ status=”llm 服务器错误“ INFO [wmain] 构建信息 |build=1 commit=“6f4ec98” tid=“10164” timestamp=1724807723 INFO [wmain] 系统信息 |n_threads=20 n_threads_batch=-1 system_info=“AVX = 0 |AVX_VNNI = 0 |AVX2 = 0 |AVX512 = 0 |AVX512_VBMI = 0 |AVX512_VNNI = 0 |AVX512_BF16 = 0 |FMA = 0 |氖 = 0 |SVE = 0 |ARM_FMA = 0 |F16C = 0 |FP16_VA = 0 |WASM_SIMD = 0 |BLAS = 1 |SSE3 = 0 |SSSE3 = 0 |VSX = 0 |MATMUL_INT8 = 0 |骆驼档案 = 1 |“ tid=”10164“ timestamp=1724807723 total_threads=28 INFO [wmain] HTTP 服务器监听 |hostname=“127.0.0.1” n_threads_http=“27” port=“58132” tid=“10164” timestamp=1724807723 llama_model_loader: 从 C:\Users\ArabTech.ollama\models\blobs\sha256-04778965089b91318ad61d0995b7e44fad4b9a9f4e049d7be90932bf8812e828(版本 GGUF V3(最新))加载了 20 个键值对和 325 个张量的元数据 llama_model_loader:转储元数据键/值。注意:KV 覆盖不应用于此输出。 llama_model_loader: - kv 0: general.architecture str = phi2 llama_model_loader: - kv 1: general.name str = Phi2 llama_model_loader: - kv 2: phi2.context_length u32 = 2048 llama_model_loader: - kv 3: phi2.embedding_length u32 = 2560 llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 10240 llama_model_loader: - kv 5: phi2.block_count u32 = 32llama_model_loader: - KV 6: phi2.attention.head_count U32 = 32 llama_model_loader: - KV 7: phi2.attention.head_count_kv U32 = 32 llama_model_loader: - KV 8: phi2.attention.layer_norm_epsilon F32 = 0.000010 llama_model_loader: - KV 9: phi2.rope.dimension_count U32 = 32 llama_model_loader: - KV 10: general.file_type U32 = 2 llama_model_loader: - KV 11: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - KV 12: tokenizer.ggml.model str = gpt2 llama_model_loader: - KV 13: tokenizer.ggml.tokens arr[str,51200] = [“!”, “”“, ”#“, ”$“, ”%“, ”&“, ”'“, ... llama_model_loader: - KV 14: tokenizer.ggml.token_type ARR[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - KV 15: tokenizer.ggml.merges arr[str,50000] = [“Ġ t”, “Ġ a”, “h e”, “i n”, “r e”,... llama_model_loader: - KV 16: tokenizer.ggml.bos_token_id U32 = 50256 llama_model_loader: - KV 17: tokenizer.ggml.eos_token_id U32 = 50256 llama_model_loader: - KV 18: tokenizer.ggml.unknown_token_id U32 = 50256 llama_model_loader: - KV 19: general.quantization_version U32 = 2 llama_model_loader: - 类型 F32: 195 张量 llama_model_loader: - 类型 q4_0: 129 张量 llama_model_loader: - 类型 q6_K:1 张量 llm_load_vocab:缺少预分词器类型,使用:'default' llm_load_vocab:llm_load_vocab:**** llm_load_vocab:生成质量将下降! llm_load_vocab:考虑重新生成模型 llm_load_vocab: **** llm_load_vocab: llm_load_vocab: 特殊令牌缓存大小 = 944 llm_load_vocab: 令牌到块缓存大小 = 0.3151 MB llm_load_print_meta: 格式 = GGUF V3(最新) llm_load_print_meta: arch = phi2 llm_load_print_meta: 词汇类型 = BPE llm_load_print_meta: n_vocab = 51200 llm_load_print_meta: n_merges = 50000 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2560 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 80 llm_load_print_meta: n_embd_head_v = 80 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 2560 llm_load_print_meta: n_embd_v_gqa = 2560 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 10240 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta:因果 attn = 1 llm_load_print_meta:池类型 = 0 llm_load_print_meta:绳索类型 = 2 llm_load_print_meta:绳索缩放 = 线性 llm_load_print_meta:freq_base_train = 10000.0 llm_load_print_meta:freq_scale_train = 1 llm_load_print_meta:n_ctx_orig_yarn = 2048 llm_load_print_meta: rope_finetuned = 未知 llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms= 0 llm_load_print_meta:模型类型 = 3B llm_load_print_meta:模型 ftype = Q4_0 llm_load_print_meta:模型参数 = 2.78 B llm_load_print_meta:模型大小 = 1.49 GiB (4.61 BPW) llm_load_print_meta:general.name = Phi2 llm_load_print_meta:BOS 令牌 = 50256 '<|endoftext|>' llm_load_print_meta:EOS 令牌 = 50256 '<|endoftext|>'llm_load_print_meta:UNK 令牌 = 50256 '<|endoftext|>' llm_load_print_meta:LF 令牌 = 128 'Ä' llm_load_print_meta:EOT 令牌 = 50256 '<|endoftext|>' llm_load_print_meta:最大令牌长度 = 256 时间=2024-08-27T18:15:24.083-07:00 level=INFO source=server.go:629 msg=“等待服务器可用” status=“llm 服务器加载模型” ggml_sycl_init: GGML_SYCL_FORCE_MMQ:否 ggml_sycl_init:SYCL_USE_XMX:是 ggml_sycl_init:找到 1 个 SYCL 设备:llm_load_tensors:ggml ctx 大小 = 0.30 MiB llm_load_tensors:将 32 个重复层卸载到 GPU llm_load_tensors:将非重复层卸载到 GPU llm_load_tensors:将 33/33 层卸载到 GPU llm_load_tensors:SYCL0 缓冲区大小 = 1456.19 MiBllm_load_tensors:SYCL_Host缓冲区大小 = 70.31 MiB llama_new_context_with_model:n_ctx = 8192 llama_new_context_with_model:n_batch = 512 llama_new_context_with_model:n_ubatch = 512 llama_new_context_with_model:flash_attn = 0 llama_new_context_with_model:freq_base = 10000.0 llama_new_context_with_model:freq_scale = 1 [SYCL] 调用 ggml_check_syclggml_check_sycl: GGML_SYCL_DEBUG: 0 ggml_check_sycl: GGML_SYCL_F16: 未找到 1 SYCL 设备: | | | | |最大 | |最大 |全球 | | | | | | | |计算|最大功|子 |内存 | | |

ID Device Type Name Version units group group size Driver version 0 [level_zero:gpu:0] Intel UHD Graphics 770 1.5 32 512 32 31709M 1.3.30398 llama_kv_cache_init: SYCL0 KV buffer size = 2560.00 MiB
llama_new_context_with_model: KV self size = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.82 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 603.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 21.01 MiB
llama_new_context_with_model: graph nodes = 1257
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded tid="10164" timestamp=1724807728
time=2024-08-27T18:15:28.364-07:00 level=INFO source=server.go:634 msg="llama runner started in 4.54 seconds"
[GIN] 2024/08/27 - 18:15:28 200 4.5565764s 127.0.0.1 POST "/api/chat"
INFO [print_timings] prompt eval time = 752.82 ms / 39 tokens ( 19.30 ms per token, 51.81 tokens per second) n_prompt_tokens_processed=39 n_tokens_second=51.80555647801916 slot_id=0 t_prompt_processing=752.815 t_token=19.30294871794872 task_id=4 tid="10164" timestamp=1724807876
INFO [print_timings] generation eval time = 8637.51 ms / 77 runs ( 112.18 ms per token, 8.91 tokens per second) n_decoded=77 n_tokens_second=8.914607209092344 slot_id=0 t_token=112.17544155844156 t_token_generation=8637.509 task_id=4 tid="10164" timestamp=1724807876
INFO [print_timings] total time = 9390.32 ms slot_id=0 t_prompt_processing=752.815 t_token_generation=8637.509 t_total=9390.324 task_id=4 tid="10164" timestamp=1724807876
[GIN] 2024/08/27 - 18:17:56 200 9.3988916s 127.0.0.1 POST "/api/chat"
[GIN] 2024/08/27 - 18:18:34 200 0s 127.0.0.1 HEAD "/"
[GIN] 2024/08/27 - 18:18:34 200 9.3004ms 127.0.0.1 POST "/api/show"
time=2024-08-27T18:18:34.995-07:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=32 layers.model=33 layers.offload=0 layers.split="" memory.available="[47.9 GiB]" memory.required.full="7.5 GiB" memory.required.partial="0 B" memory.required.kv="256.0 MiB" memory.required.allocations="[7.5 GiB]" memory.weights.total="7.2 GiB" memory.weights.repeating="6.6 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-08-27T18:18:35.001-07:00 level=INFO source=server.go:395 msg="starting llama server" cmd="C:\Users\ArabTech\Desktop\1\ipex-llm\dist\windows-amd64\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe --model C:\Users\ArabTech.ollama\models\blobs\sha256-20ee18469ac48c875af10c8f970b0a5371c73c7109bfdd3835615777f75bf26b --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --no-mmap --parallel 4 --port 58198"
time=2024-08-27T18:18:35.002-07:00 level=INFO source=sched.go:450 msg="loaded runners" count=2
time=2024-08-27T18:18:35.002-07:00 level=INFO source=server.go:595 msg="waiting for llama runner to start responding"
time=2024-08-27T18:18:35.002-07:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info build=1 commit="6f4ec98" tid="16608" timestamp=1724807915
INFO [wmain] system info n_threads=20 n_threads_batch=-1 system_info="AVX = 0 AVX_VNNI = 0 AVX2 = 0 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 AVX512_BF16 = 0 FMA = 0 INFO [wmain] HTTP server listening hostname="127.0.0.1" n_threads_http="27" port="58198" tid="16608" timestamp=1724807915
llama_model_loader: loaded meta data with 30 key-value pairs and 292 tensors from C:\Users\ArabTech.ollama\models\blobs\sha256-20ee18469ac48c875af10c8f970b0a5371c73c7109bfdd3835615777f75bf26b (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8b Instruct
llama_model_loader: - kv 3: general.version str = V2
llama_model_loader: - kv 4: general.organization str = Unsloth
llama_model_loader: - kv 5: general.finetune str = instruct
llama_model_loader: - kv 6: general.basename str = meta-llama-3.1
llama_model_loader: - kv 7: general.size_label str = 8B
llama_model_loader: - kv 8: general.license str = llama3.1
llama_model_loader: - kv 9: llama.block_count u32 = 32
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 7
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type q8_0: 226 tensors
llm_load_vocab: special tokens cache size = 256
time=2024-08-27T18:18:35.253-07:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name = Meta Llama 3.1 8b Instruct
llm_load_print_meta: BOS token = 128000 '< begin_of_text >'
llm_load_print_meta: EOS token = 128009 '< eot_id >'
llm_load_print_meta: PAD token = 128004 '< finetune_right_pad_id >'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '< eot_id >'
llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: SYCL0 buffer size = 7605.34 MiB
llm_load_tensors: SYCL_Host buffer size = 532.31 MiB
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
Max Max Global
compute Max work sub mem ID Device Type Name Version units group group size Driver version


0 [level_zero:gpu:0] Intel UHD Graphics 770 1.5 32 512 32 31709M 1.3.30398 llama_kv_cache_init: SYCL0 KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 2.02 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 258.50 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1062
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded tid="16608" timestamp=1724807924
time=2024-08-27T18:18:44.565-07:00 level=INFO source=server.go:634 msg="llama runner started in 9.56 seconds"
[GIN] 2024/08/27 - 18:18:44 200 9.5933075s 127.0.0.1 POST "/api/chat"
[GIN] 2024/08/27 - 18:21:10 200 2m2s 127.0.0.1 POST "/api/chat"

Thanks for your reply, I have followed other AI model deployment methods and can already run correctly with GPUs.