labring / FastGPT

FastGPT is a knowledge-based platform built on the LLMs, offers a comprehensive suite of out-of-the-box capabilities such as data processing, RAG retrieval, and visual AI workflow orchestration, letting you easily develop and deploy complex question-answering systems without the need for extensive setup or configuration.
https://tryfastgpt.ai
Other
17.16k stars 4.6k forks source link

文档可以解析,但是用了10个AI模型都不总结(文档内容比较多) #2759

Open SDAIer opened 1 week ago

SDAIer commented 1 week ago

4.8.10-fix (为了避免字符转义,以下内容中<>更改为《》)

一个AI应用中,用了10个不同的模型(通过ollama本地部署),通过文档解析PDF,然后让AI【总结文档内容】,文档解析模块可以正常解析出内容,但是所有模型都提示没有收到文档内容.从以下调试信息中可以看到每个AI都已经在《quote></quote》中包括了文档的内容。(如果文档内容减少到1000个文字左右就可以) image

如果个别模型有问题可能是模型的问题,但是用了10个不同的模型效果完全一样

5个AI回复内容如下: ·············································································· 参考 A glm4:9b qwen2.5:32b ············································································ 请提供您希望我总结的文档内容或具体信息,以及您的问题,这样我可以更准确地帮助您。 -······························································ 参考 B llama3.1:8b qwen2.5:14b ································································ · 您好!您希望我总结什么内容呢?请提供具体信息或文档内容,这样我可以更好地帮助您。如果有之前提供的特定《Quote></Quote》引用,请一并告知。 -······························································ 参考 C llava:latest gemma2:27b ····························································· 我明白了!我会像一个文档分析专家一样,仔细阅读你提供的文本,并专注于 《Quote></Quote》标签中的内容。然后,我会根据你的问题,从引用部分中提取相关信息,并提供简洁明了的答案。 请随时问我问题,我会尽力给你最好的回答! -······························································ 参考 D llama-3-8b-gpt-4o-ru1.0 phi3.5 ································································· 基于提供给的指示,我会从文本中识别重要信息并以一种集中和清晰的方式回答用户问题。例如: 文本: 《Quote>"全球变暖正在导致海平面上升,这对于那些位于低洼地区或近海边的城市来说是一个重大风险。" - 环保专家 如果用户问:“全球变暖会影响何处最为显著?”,我将回答: 由于全球变暖导致海平面上升的事实,城市在低洼 -······························································ 参考 E gemma2:9b mistral-small ································································· 好的,请提供你需要分析的文档或具体的用户问题,我会尽力帮助你。

如果只输入文档,不输入任何问题,AI就可以返回文档的相关信息。

image DEBUG信息如下 System 1、你是一个文档分析专家,能够从文档中分析与用户问题相关的信息 2、根据 《Quote></Quote》引用的内容回答用户问题

将 《Quote></Quote》中的内容作为你的知识: 《Quote> File: 600007财报.pdf 《Content> 中国国际贸易中心股份有限公司 2024 年半年度报告公司代码:600007 公司简称:中国国贸中国国际贸易中心股份有限公司2024 年半年度报告中国国际贸易中心股份有限公司 2024 年半年度报告1重要提示一、 本公司董事会、监事会及董事、监事、高级管理人员保证半年度报告内容的真实性、准确性、完整性,不存在虚假记载、误导性陈述或重大遗漏,并承担个别和连带的法律责任。 二、 公司全体董事出席董事会会议。 三、 公司半年度财务报告已经普华永道中天会计师事务所(特殊普通合伙)审阅,未经审计。 。。。。。为了避免篇幅过长,此处省略了几千个文字。。。。 八、 净资产收益率及每股收益加权平均 每股收益净资产收益率(%) 基本每股收益 稀释每股收益2024 年1 至 6 月2023 年1 至 6 月2024 年1 至 6 月2023 年1 至 6 月2024 年1 至 6 月2023 年1 至 6 月归属于公司普通股股东的净利润 7.16% 7.08% 0.68 0.65 0.68 0.65扣除非经常性损益后归属于公司普通股股东的净利润 7.02% 7.01% 0.67 0.64 0.67 0.64 《/Content> 《/Quote> Human 总结内容 AI 请提供您希望我总结的文档内容或指定具体的问题,这样我可以更准确地帮助您。没有具体的文档内容或问题,我无法进行有效的分析和总结。您可以把相关内容放在标签里。

lijiajun1997 commented 1 week ago

本地模型部署的问题 System prompt写死了 没有传入这个字段

SDAIer commented 1 week ago

麻烦具体介绍下,没明白。 本地模型System prompt写死了什么?以上debug信息不是带出了文档解析的内容了吗。

经过测试, 1、如果文档内容控制在1800个字以内,每个模型都可以识别 2、如果超过1800个字就无法识别 3、如果只上传文档、不输入问题,AI会自动识别内容

模型的maxContext、maxResponse、maxResponse从4000到30000都尝试过,没效果 { "model": "qwen2.5:32b", "name": "localNet-V-qwen2.5:32b", "avatar": "/imgs/model/openai.svg", "maxContext": 4000, "maxResponse": 4000, "quoteMaxToken": 4000, "maxTemperature": 1.2, "charsPointsPrice": 0, "censor": false, "vision": false, "datasetProcess": true, "usedInClassify": true, "usedInExtractFields": true, "usedInToolCall": false, "usedInQueryExtension": true, "toolChoice": false, "functionCall": false, "customCQPrompt": "", "customExtractPrompt": "", "defaultSystemChatPrompt": "", "defaultConfig": {} },

SDAIer commented 1 week ago

@c121914yu

SDAIer commented 1 week ago

进一步测试效果, 取消了AI关联文档引用和用户问题,用文本拼接的方式把文档解析和用户问题拼接在一起,可以识别了。 但是这种做法效果并不好,比如无法有效的识别多个附件的名字,也无法有效的对多个附件内容进行比对

还请大神分析下原因 image

SDAIer commented 1 week ago

公有云模型就没有问题,本地ollama的模型就是不行,我对比了两个模型的debug信息一样,为什么超过1000个字本地模型就提示没有发现发现文档。 image

image

c121914yu commented 1 week ago

云端可以本地不行,可能是部署的工具不支持 system 吧,或者被过滤了,可以直接 debug 部署工具拿到的数据

c121914yu commented 1 week ago

上面发的debug信息不够吗?还需要哪些信息?xf的吗?所有信息都传到fastgpt里了,本地和云端的debug信息一样

---原始邮件--- 发件人: @.> 发送时间: 2024年9月21日(周六) 晚上8:17 收件人: @.>; 抄送: @.**@.>; 主题: Re: [labring/FastGPT] 文档可以解析,但是用了10个AI模型都不总结(文档内容比较多) (Issue #2759)

云端可以本地不行,可能是部署的工具不支持 system 吧,或者被过滤了,可以直接 debug 部署工具拿到的数据

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

那只是fastgpt里发出去的信息,xf又不一定全部接收。

SDAIer commented 6 days ago

以gemma-2-27b 为例,其他模型问题现象一样。即:上传了一份文字超过3000的文档,通过debug可以看到,AI可以引用 文档解析模块的内容,但是AI仍然提示没有上传文档。

"model": "gemma2:27b", "name": "localNet-ollama-gemma2:27b", "avatar": "/imgs/model/openai.svg", "maxContext": 8000, "maxResponse": 8000, "quoteMaxToken":5000,

或者改为如下,都一样的效果

"model": "gemma2:27b", "name": "localNet-ollama-gemma2:27b", "avatar": "/imgs/model/openai.svg", "maxContext": 120000, "maxResponse": 10000, "quoteMaxToken":5000,

ollama debgug信息

重点信息 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2397062001/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 4 --port 38200"

以上-ctx-size 8192,说明congfig.json中的信息没生效,可能由此导致超过了模型默认的8192限制

9月 22 13:35:58 gpu systemd[1]: Stopping Ollama Service... 9月 22 13:36:01 gpu systemd[1]: Stopped Ollama Service. 9月 22 13:36:01 gpu systemd[1]: Started Ollama Service. 9月 22 13:36:01 gpu ollama[50713]: 2024/09/22 13:36:01 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0: app:// file:// tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.944+08:00 level=INFO source=images.go:753 msg="total blobs: 34" 9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.946+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" 9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.947+08:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)" 9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.950+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2397062001/runners 9月 22 13:36:17 gpu ollama[50713]: time=2024-09-22T13:36:17.632+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]" 9月 22 13:36:17 gpu ollama[50713]: time=2024-09-22T13:36:17.632+08:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" 9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="2.8 GiB" 9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="19.7 GiB" 9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="14.8 GiB" 9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="6.4 GiB" 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.572+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc gpu=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 parallel=4 available=21122187264 required="18.8 GiB" 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.572+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="100.7 GiB" free_swap="3.5 GiB" 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.573+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=47 layers.offload=47 layers.split="" memory.available="[19.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.8 GiB" memory.required.partial="18.8 GiB" memory.required.kv="2.9 GiB" memory.required.allocations="[18.8 GiB]" memory.weights.total="16.5 GiB" memory.weights.repeating="15.6 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="562.0 MiB" memory.graph.partial="1.4 GiB" 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2397062001/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 4 --port 38200" 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.593+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" 9月 22 13:36:33 gpu ollama[50713]: INFO [main] build info | build=10 commit="9225b05" tid="140300405710848" timestamp=1726983393 9月 22 13:36:33 gpu ollama[50713]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140300405710848" timestamp=1726983393 total_threads=64 9月 22 13:36:33 gpu ollama[50713]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="38200" tid="140300405710848" timestamp=1726983393 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc (version GGUF V3 (latest)) 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 0: general.architecture str = gemma2 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 1: general.name str = gemma-2-27b-it 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 4: gemma2.block_count u32 = 46 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 11: general.file_type u32 = 2 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.846+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ... 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - type f32: 185 tensors 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - type q4_0: 322 tensors 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - type q6_K: 1 tensors 9月 22 13:36:34 gpu ollama[50713]: llm_load_vocab: special tokens cache size = 108 9月 22 13:36:34 gpu ollama[50713]: llm_load_vocab: token to piece cache size = 1.6014 MB 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: format = GGUF V3 (latest) 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: arch = gemma2 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: vocab type = SPM 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_vocab = 256000 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_merges = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: vocab_only = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_ctx_train = 8192 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd = 4608 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_layer = 46 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_head = 32 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_head_kv = 16 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_rot = 128 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_swa = 4096 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_head_k = 128 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_head_v = 128 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_gqa = 2 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_k_gqa = 2048 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_v_gqa = 2048 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_norm_eps = 0.0e+00 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_logit_scale = 0.0e+00 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_ff = 36864 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_expert = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_expert_used = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: causal attn = 1 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: pooling type = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: rope type = 2 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: rope scaling = linear 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: freq_base_train = 10000.0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: freq_scale_train = 1 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_ctx_orig_yarn = 8192 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: rope_finetuned = unknown 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_d_conv = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_d_inner = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_d_state = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_dt_rank = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_dt_b_c_rms = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model type = 27B 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model ftype = Q4_0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model params = 27.23 B 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model size = 14.55 GiB (4.59 BPW) 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: general.name = gemma-2-27b-it 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: BOS token = 2 '' 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: EOS token = 1 '' 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: UNK token = 3 '' 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: PAD token = 0 '' 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: LF token = 227 '<0x0A>' 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: EOT token = 107 '' 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: max token length = 93 9月 22 13:36:34 gpu ollama[50713]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 9月 22 13:36:34 gpu ollama[50713]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 9月 22 13:36:34 gpu ollama[50713]: ggml_cuda_init: found 1 CUDA devices: 9月 22 13:36:34 gpu ollama[50713]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes 9月 22 13:36:34 gpu ollama[50713]: llm_load_tensors: ggml ctx size = 0.45 MiB 9月 22 13:36:35 gpu ollama[50713]: time=2024-09-22T13:36:35.304+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" 9月 22 13:36:36 gpu ollama[50713]: time=2024-09-22T13:36:36.592+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: offloading 46 repeating layers to GPU 9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: offloading non-repeating layers to GPU 9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: offloaded 47/47 layers to GPU 9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: CPU buffer size = 922.85 MiB 9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: CUDA0 buffer size = 14898.60 MiB 9月 22 13:36:39 gpu ollama[50713]: time=2024-09-22T13:36:39.309+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: n_ctx = 8192 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: n_batch = 512 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: n_ubatch = 512 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: flash_attn = 0 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: freq_base = 10000.0 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: freq_scale = 1 9月 22 13:36:39 gpu ollama[50713]: llama_kv_cache_init: CUDA0 KV buffer size = 2944.00 MiB 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: KV self size = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: CUDA0 compute buffer size = 578.00 MiB 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: CUDA_Host compute buffer size = 41.01 MiB 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: graph nodes = 1850 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: graph splits = 2 9月 22 13:36:39 gpu ollama[50713]: time=2024-09-22T13:36:39.562+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 9月 22 13:36:39 gpu ollama[50713]: INFO [main] model loaded | tid="140300405710848" timestamp=1726983399 9月 22 13:36:39 gpu ollama[50713]: time=2024-09-22T13:36:39.814+08:00 level=INFO source=server.go:626 msg="llama runner started in 6.22 seconds" 9月 22 13:36:42 gpu ollama[50713]: [GIN] 2024/09/22 - 13:36:42 | 200 | 10.06400965s | 172.16.1.219 | POST "/v1/chat/completions" 9月 22 13:36:46 gpu systemd[1]: Stopping Ollama Service... 9月 22 13:36:47 gpu systemd[1]: Stopped Ollama Service. 9月 22 13:36:47 gpu systemd[1]: Started Ollama Service. 9月 22 13:36:47 gpu ollama[50857]: 2024/09/22 13:36:47 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost: http://127.0.0.1 https://127.0.0.1 http://127.0.0.1: https://127.0.0.1: http://0.0.0.0 https://0.0.0.0 http://0.0.0.0: https://0.0.0.0: app:// file:// tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.610+08:00 level=INFO source=images.go:753 msg="total blobs: 34" 9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.613+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" 9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.614+08:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)" 9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.616+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2548666145/runners 9月 22 13:37:02 gpu ollama[50857]: time=2024-09-22T13:37:02.902+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2 cuda_v11]" 9月 22 13:37:02 gpu ollama[50857]: time=2024-09-22T13:37:02.902+08:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" 9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="2.8 GiB" 9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="19.7 GiB" 9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="14.8 GiB" 9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="6.4 GiB" 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.159+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc gpu=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 parallel=4 available=21122187264 required="18.8 GiB" 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.159+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="100.7 GiB" free_swap="3.5 GiB" 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.160+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=47 layers.offload=47 layers.split="" memory.available="[19.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.8 GiB" memory.required.partial="18.8 GiB" memory.required.kv="2.9 GiB" memory.required.allocations="[18.8 GiB]" memory.weights.total="16.5 GiB" memory.weights.repeating="15.6 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="562.0 MiB" memory.graph.partial="1.4 GiB" 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.176+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2548666145/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 4 --port 42032" 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.177+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.178+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.178+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" 9月 22 13:37:53 gpu ollama[50857]: INFO [main] build info | build=10 commit="9225b05" tid="140713765588992" timestamp=1726983473 9月 22 13:37:53 gpu ollama[50857]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140713765588992" timestamp=1726983473 total_threads=64 9月 22 13:37:53 gpu ollama[50857]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="42032" tid="140713765588992" timestamp=1726983473 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc (version GGUF V3 (latest)) 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 0: general.architecture str = gemma2 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 1: general.name str = gemma-2-27b-it 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 4: gemma2.block_count u32 = 46 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 11: general.file_type u32 = 2 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.431+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ... 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - type f32: 185 tensors 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - type q4_0: 322 tensors 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - type q6_K: 1 tensors 9月 22 13:37:53 gpu ollama[50857]: llm_load_vocab: special tokens cache size = 108 9月 22 13:37:53 gpu ollama[50857]: llm_load_vocab: token to piece cache size = 1.6014 MB 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: format = GGUF V3 (latest) 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: arch = gemma2 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: vocab type = SPM 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_vocab = 256000 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_merges = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: vocab_only = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_ctx_train = 8192 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd = 4608 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_layer = 46 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_head = 32 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_head_kv = 16 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_rot = 128 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_swa = 4096 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_head_k = 128 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_head_v = 128 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_gqa = 2 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_k_gqa = 2048 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_v_gqa = 2048 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_norm_eps = 0.0e+00 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_logit_scale = 0.0e+00 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_ff = 36864 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_expert = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_expert_used = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: causal attn = 1 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: pooling type = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: rope type = 2 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: rope scaling = linear 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: freq_base_train = 10000.0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: freq_scale_train = 1 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_ctx_orig_yarn = 8192 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: rope_finetuned = unknown 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_d_conv = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_d_inner = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_d_state = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_dt_rank = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_dt_b_c_rms = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model type = 27B 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model ftype = Q4_0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model params = 27.23 B 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model size = 14.55 GiB (4.59 BPW) 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: general.name = gemma-2-27b-it 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: BOS token = 2 '' 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: EOS token = 1 '' 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: UNK token = 3 '' 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: PAD token = 0 '' 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: LF token = 227 '<0x0A>' 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: EOT token = 107 '' 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: max token length = 93 9月 22 13:37:53 gpu ollama[50857]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 9月 22 13:37:53 gpu ollama[50857]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 9月 22 13:37:53 gpu ollama[50857]: ggml_cuda_init: found 1 CUDA devices: 9月 22 13:37:53 gpu ollama[50857]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes 9月 22 13:37:54 gpu ollama[50857]: llm_load_tensors: ggml ctx size = 0.45 MiB 9月 22 13:37:54 gpu ollama[50857]: time=2024-09-22T13:37:54.888+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" 9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: offloading 46 repeating layers to GPU 9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: offloading non-repeating layers to GPU 9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: offloaded 47/47 layers to GPU 9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: CPU buffer size = 922.85 MiB 9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: CUDA0 buffer size = 14898.60 MiB 9月 22 13:37:56 gpu ollama[50857]: time=2024-09-22T13:37:56.042+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: n_ctx = 8192 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: n_batch = 512 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: n_ubatch = 512 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: flash_attn = 0 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: freq_base = 10000.0 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: freq_scale = 1 9月 22 13:37:58 gpu ollama[50857]: llama_kv_cache_init: CUDA0 KV buffer size = 2944.00 MiB 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: KV self size = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: CUDA0 compute buffer size = 578.00 MiB 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: CUDA_Host compute buffer size = 41.01 MiB 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: graph nodes = 1850 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: graph splits = 2 9月 22 13:37:58 gpu ollama[50857]: INFO [main] model loaded | tid="140713765588992" timestamp=1726983478 9月 22 13:37:58 gpu ollama[50857]: time=2024-09-22T13:37:58.556+08:00 level=INFO source=server.go:626 msg="llama runner started in 5.38 seconds" 9月 22 13:38:00 gpu ollama[50857]: [GIN] 2024/09/22 - 13:38:00 | 200 | 9.160086456s | 172.16.1.219 | POST "/v1/chat/completions" (base) [root@gpu ~]#