SDAIer commented 3 weeks ago

fastgpt 4.8.10-fix oneapi0.6.7

模型："glm4:9b 相关信息

fastgpt config.json

ollama debug

根据ollama debug信息，发现fastgpt config.json中设置的maxContext and maxResponse没有生效，所以导致内容比较多的文档，AI虽然可以已通过文档解析模块获取文档的内容，但是最终AI回复找不到文档。

9月 22 15:46:03 gpu ollama[57349]: time=2024-09-22T15:46:03.116+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2118317042/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 4 --port 38148"

ollama bebug 信息中：正在运行4个服务线程（--parallel 4），并且总上下文大小为8k（--ctx-size 8192），因此每个请求都使用默认的上下文窗口2048个令牌。config.json中maxContext和maxResponse所做的配置好像与ollama没有匹配上。

fastgpt config.json中设置的maxContext and maxResponse这两个参数，对应的ollama配置元素应该是是num_ctx和num_predict。 ollama 参数URL https://github.com/ollama/ollama/blob/ad935f45ac19a8ba090db32580f3a6469e9858bb/docs/modelfile.md#valid-parameters-and-values

num_ctx：Sets the size of the context window used to generate the next token. (Default: 2048) num_predict:Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context)

并且，AI debug信息中看到已经获取到引用的信息，但是AI提示没有内容引用。是不是与模型的stop参数有关?

如果只上传附件，没有输入问题，模型可以输出总结内容

测试2：

如果用一个小一点的文件（例如1000个字符），fastgpt通过oneapi调用olalma模型可以正常按照指令回复文档中的内容。

如果将文档内容增加，包含9576个字符。效果就完全不一样：

场景1：fastgpt通过oneapi调用ollama模型进行回复（完全没有按照问题的指令回复）

config.json

ollama debug：（注意ctx-size 8192，--parallel 4 。即默认的num_ctx2048*parallel4《参考：num_ctx Sets the size of the context window used to generate the next token. (Default: 2048)》）

9月 23 12:20:58 gpu ollama[38429]: time=2024-09-23T12:20:58.741+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1515539425/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 4 --port 44108"

场景2：fastgpt不用onepia调用，而是用http直接调用ollama模型回复，并制定了num_ctx:10000（虽然回复的答案不见得正确，但只是按照问题的指令执行）

ollama debug（--ctx-size 40000， --parallel 4 ，即根据设置的num_ctx10000*4个并发）

9月 23 12:17:12 gpu ollama[38429]: time=2024-09-23T12:17:12.966+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1515539425/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 --ctx-size 40000 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 4 --port 39632"

c121914yu commented 3 weeks ago

gpt 接口里没有 maxContext，只有 max_tokens（maxResponse)，而且不取决于你的配置，取决于你在页面上选择。这两个参数只是UI和过滤上用到，max_tokens 与 maxResponse和messages 的 token 有关。想测试，最简单的直接自己写个 completions 接口，打印 body 就知道了。

SDAIer commented 3 weeks ago

多谢回复。既然接口里没有maxContent，那么配置里和ui上这个参数有什么作用？

另外，根据我上面的测试情况，既然模型可以接收比较大的上下文，为什么文档内容比较多时模型处理不了（而且没有超过模型的最大上下文限制），这个该怎么处理

---原始邮件--- 发件人: @.> 发送时间: 2024年9月26日(周四) 中午11:24 收件人: @.>; 抄送: @.**@.>; 主题: Re: [labring/FastGPT] fastgpt---ollama参数没生效 (Issue #2770)

gpt 接口里没有 maxContext，只有 max_tokens（maxResponse)，而且不取决于你的配置，取决于你在页面上选择。 image.png (view on web)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

c121914yu commented 3 weeks ago

多谢回复。既然接口里没有maxContent，那么配置里和ui上这个参数有什么作用？另外，根据我上面的测试情况，既然模型可以接收比较大的上下文，为什么文档内容比较多时模型处理不了（而且没有超过模型的最大上下文限制），这个该怎么处理 … ---原始邮件--- 发件人: @.> 发送时间: 2024年9月26日(周四) 中午11:24 收件人: @.>; 抄送: @.**@.>; 主题: Re: [labring/FastGPT] fastgpt---ollama参数没生效 (Issue #2770) gpt 接口里没有 maxContext，只有 max_tokens（maxResponse)，而且不取决于你的配置，取决于你在页面上选择。 image.png (view on web) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

不是很懂，我用的在线模型都是可以的。你可以看看 role=system有没有生效

SDAIer commented 3 weeks ago

请问大神咋看role=system是否生效。

在线模型的确没有问题。

但是本地ollama和xf模型的确是存在这个问题，我如果直接用http手动调用ollama模型并指定参数num_ctx:10000就可以。

现在这个问题导致本地类似合同审核或者财报解读等场景，由于文档内容比较大没法使用本地模型来处理

上面的各种测试情况麻烦大神再看看。ui里面对应着congfig.json设置了。现在问题就是如何能将最大上下文知指定的数值传递给ollama

SDAIer commented 3 weeks ago

2780

ollama和xf一样的问题，都是上下文长度问题，如何将gpt设置的上下文数量传递给本地模型，而不是使用他们默认的一个娇小的数量

labring / FastGPT

fastgpt---ollama参数没生效 #2770

模型："glm4:9b 相关信息

fastgpt config.json

ollama debug

测试2：

场景1：fastgpt通过oneapi调用ollama模型进行回复（完全没有按照问题的指令回复）

config.json

ollama debug：（注意ctx-size 8192，--parallel 4 。即默认的num_ctx2048*parallel4《参考：num_ctx Sets the size of the context window used to generate the next token. (Default: 2048)》）

场景2：fastgpt不用onepia调用，而是用http直接调用ollama模型回复，并制定了num_ctx:10000（虽然回复的答案不见得正确，但只是按照问题的指令执行）

ollama debug（--ctx-size 40000， --parallel 4 ，即根据设置的num_ctx10000*4个并发）

2780

labring / FastGPT

fastgpt---ollama参数没生效 #2770

模型："glm4:9b 相关信息

fastgpt config.json

ollama debug

测试2：

场景1：fastgpt通过oneapi调用ollama模型进行回复（完全没有按照问题的指令回复）

config.json

ollama debug：（注意ctx-size 8192，--parallel 4 。即默认的num_ctx2048*parallel4《参考：num_ctx Sets the size of the context window used to generate the next token. (Default: 2048)》 ）

场景2：fastgpt不用onepia调用，而是用http直接调用ollama模型回复，并制定了num_ctx:10000（虽然回复的答案不见得正确，但只是按照问题的指令执行）

ollama debug（--ctx-size 40000， --parallel 4 ，即根据设置的num_ctx10000*4个并发）

2780

ollama debug：（注意ctx-size 8192，--parallel 4 。即默认的num_ctx2048*parallel4《参考：num_ctx Sets the size of the context window used to generate the next token. (Default: 2048)》）