xinference、qwen2的工具调用功能，对话中输出了思考过程

例行检查

[x] 我已确认目前没有类似 issue
[x] 我已完整查看过项目 README，以及项目文档
[x] 我使用了自己的 key，并确认我的 key 是可正常使用的
[x] 我理解并愿意跟进此 issue，协助测试和提供反馈
[x] 我理解并认可上述内容，并理解项目维护者精力有限，不遵循规则的 issue 可能会被无视或直接关闭

你的版本

[x] 私有部署版本, 具体版本号: 4.8.4-fix

问题描述, 日志截图

xinference部署qwen2-72b，通过oneapi接入fastgpt，使用qwen的工具调用功能时，对话中输出了大模型的思考过程，希望不要输出大模型的思考过程。

关联issue： https://github.com/labring/FastGPT/issues/1668

版本信息：

xinference:0.12.1 fastgpt:4.8.4-fix oneapi:0.5.10 qwen2:Qwen2-72B-Instruct-GPTQ-Int4

config.json

 {
      "model": "qwen:72b",
      "name": "qwen:72b",
      "maxContext": 32000,
      "avatar": "/imgs/model/qwen.svg",
      "maxResponse": 6000,
      "quoteMaxToken": 13000,
      "maxTemperature": 1.2,
      "charsPointsPrice": 0,
      "censor": false,
      "vision": false,
      "datasetProcess": false,
      "usedInClassify": true,
      "usedInExtractFields": true,
      "usedInToolCall": true,
      "usedInQueryExtension": true,
      "toolChoice": true,
      "functionCall": true,
      "customCQPrompt": "",
      "customExtractPrompt": "",
      "defaultSystemChatPrompt": "",
      "defaultConfig": {}
    }

xinference的日志：

2024-06-21 10:26:12,219 xinference.core.supervisor 6285 DEBUG    Enter get_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>, 'qwen:72b'), kwargs: {}
2024-06-21 10:26:12,219 xinference.core.worker 6285 DEBUG    Enter get_model, args: (<xinference.core.worker.WorkerActor object at 0x7f150fc05e40>,), kwargs: {'model_uid': 'qwen:72b-1-0'}
2024-06-21 10:26:12,219 xinference.core.worker 6285 DEBUG    Leave get_model, elapsed time: 0 s
2024-06-21 10:26:12,219 xinference.core.supervisor 6285 DEBUG    Leave get_model, elapsed time: 0 s
2024-06-21 10:26:12,220 xinference.core.supervisor 6285 DEBUG    Enter describe_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>, 'qwen:72b'), kwargs: {}
2024-06-21 10:26:12,220 xinference.core.worker 6285 DEBUG    Enter describe_model, args: (<xinference.core.worker.WorkerActor object at 0x7f150fc05e40>,), kwargs: {'model_uid': 'qwen:72b-1-0'}
2024-06-21 10:26:12,220 xinference.core.worker 6285 DEBUG    Leave describe_model, elapsed time: 0 s
2024-06-21 10:26:12,220 xinference.core.supervisor 6285 DEBUG    Leave describe_model, elapsed time: 0 s
2024-06-21 10:26:12,222 xinference.core.model 145584 DEBUG    Enter wrapped_func, args: (<xinference.core.model.ModelActor object at 0x7f4fdbf5f240>, '现在的时间', '', [], {'temperature': 0.0, 'tool_choice': 'auto', 'tools': <list_iterator object at 0x7f44cc195840>, 'stream': True}), kwargs: {}
2024-06-21 10:26:12,222 xinference.core.model 145584 DEBUG    Request chat, current serve request count: 0, request limit: None for the model qwen:72b
2024-06-21 10:26:12,222 xinference.model.llm.vllm.core 145584 DEBUG    Enter generate, prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Answer the following questions as best you can. You have access to the following APIs:

oWmk9Z: Call this tool to interact with the oWmk9Z API. What is the oWmk9Z API useful for? 获取用户当前时区的时间。 Parameters: [] Format the arguments as a JSON object.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [oWmk9Z]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can be repeated zero or more times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: 现在的时间<|im_end|>
<|im_start|>assistant
, generate config: {'temperature': 0.0, 'tool_choice': 'auto', 'stream': True, 'stop': ['<|endoftext|>', '<|im_start|>', '<|im_end|>'], 'stop_token_ids': [151643, 151644, 151645]}
2024-06-21 10:26:12,223 xinference.core.model 145584 DEBUG    After request chat, current serve request count: 0 for the model qwen:72b
2024-06-21 10:26:12,223 xinference.core.model 145584 DEBUG    Leave wrapped_func, elapsed time: 0 s
2024-06-21 10:26:23,234 xinference.model.llm.utils 145584 DEBUG    Tool call content: 需要获取当前时间，但是这需要调用API或者使用时间工具，但是提供的工具只有oWmk9Z可以获取用户当前时区的时间，可以尝试调用。, func: oWmk9Z, args: {}
2024-06-21 10:26:23,326 xinference.core.supervisor 6285 DEBUG    Enter get_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>, 'qwen:72b'), kwargs: {}
2024-06-21 10:26:23,326 xinference.core.worker 6285 DEBUG    Enter get_model, args: (<xinference.core.worker.WorkerActor object at 0x7f150fc05e40>,), kwargs: {'model_uid': 'qwen:72b-1-0'}
2024-06-21 10:26:23,327 xinference.core.worker 6285 DEBUG    Leave get_model, elapsed time: 0 s
2024-06-21 10:26:23,327 xinference.core.supervisor 6285 DEBUG    Leave get_model, elapsed time: 0 s
2024-06-21 10:26:23,327 xinference.core.supervisor 6285 DEBUG    Enter describe_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>, 'qwen:72b'), kwargs: {}
2024-06-21 10:26:23,327 xinference.core.worker 6285 DEBUG    Enter describe_model, args: (<xinference.core.worker.WorkerActor object at 0x7f150fc05e40>,), kwargs: {'model_uid': 'qwen:72b-1-0'}
2024-06-21 10:26:23,327 xinference.core.worker 6285 DEBUG    Leave describe_model, elapsed time: 0 s
2024-06-21 10:26:23,328 xinference.core.supervisor 6285 DEBUG    Leave describe_model, elapsed time: 0 s
2024-06-21 10:26:23,329 xinference.core.model 145584 DEBUG    Enter wrapped_func, args: (<xinference.core.model.ModelActor object at 0x7f4fdbf5f240>, '<TOOL>', '', [{'role': 'user', 'content': '现在的时间'}, {'role': 'assistant', 'tool_calls': [{'id': 'eebb3267-7a09-49d9-9289-48d79752c061', 'type': 'function', 'function': {'name': 'oWmk9Z', 'arguments': '{}'}}]}, {'tool_call_id': 'eebb3267-7a09-49d9-9289-48d79752c061', 'role': 'tool', 'name': 'oWmk9Z', 'content': '{\n  "time": "2024-06-21 10:26:18 Friday"\n}'}], {'temperature': 0.0, 'tool_choice': 'auto', 'tools': <list_iterator object at 0x7f44cc12bbe0>, 'stream': True}), kwargs: {}
2024-06-21 10:26:23,329 xinference.core.model 145584 DEBUG    Request chat, current serve request count: 0, request limit: None for the model qwen:72b
2024-06-21 10:26:23,329 xinference.model.llm.vllm.core 145584 DEBUG    Enter generate, prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Answer the following questions as best you can. You have access to the following APIs:

oWmk9Z: Call this tool to interact with the oWmk9Z API. What is the oWmk9Z API useful for? 获取用户当前时区的时间。 Parameters: [] Format the arguments as a JSON object.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [oWmk9Z]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can be repeated zero or more times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: 现在的时间<|im_end|>
<|im_start|>assistant
Thought: I can use oWmk9Z.
Action: oWmk9Z
Action Input: {}<|im_end|>
<|im_start|>function
Observation: {
  "time": "2024-06-21 10:26:18 Friday"
}<|im_end|>
<|im_start|>assistant
, generate config: {'temperature': 0.0, 'tool_choice': 'auto', 'stream': True, 'stop': ['<|endoftext|>', '<|im_start|>', '<|im_end|>'], 'stop_token_ids': [151643, 151644, 151645]}
2024-06-21 10:26:23,329 xinference.core.model 145584 DEBUG    After request chat, current serve request count: 0 for the model qwen:72b
2024-06-21 10:26:23,330 xinference.core.model 145584 DEBUG    Leave wrapped_func, elapsed time: 0 s
2024-06-21 10:26:26,251 xinference.model.llm.utils 145584 DEBUG    Tool call content: 现在的时间是2024年6月21日星期五的10:26:18。, func: None, args: None

修改配置文件，toolchoice为false，functioncall为true，无法调用工具

这个issue应该提到xinference下面，xinderence下个版本会支持，可以先试试qwen1.5 xorbitsai/inference#1642

感谢，尝试了qwen1.5-14b，确实可以正常使用。

刚升级了inference0.12.2，针对qwen2，确实不输出思考的过程了，但还是会多余地回复一些思考的中间值 @zhanghx0905

inference log

Question: 你是谁<|im_end|>
<|im_start|>assistant
Thought: I now know the final answer.
Final answer:  我是通义千问，由阿里云开发的AI助手。我被设计用来回答各种问题，提供信息和与用户进行对话。有什么我可以帮助你的吗？<|im_end|>
<|im_start|>user
Question: 获取当前时间<|im_end|>
<|im_start|>assistant
Thought: I can use m3VVjR.
Action: m3VVjR
Action Input: {}<|im_end|>
<|im_start|>function
Observation: {
  "time": "2024-06-21 19:08:26 Friday"
}<|im_end|>
<|im_start|>assistant
, generate config: {'temperature': 0.0, 'tool_choice': 'auto', 'stream': True, 'stop': ['<|endoftext|>', '<|im_start|>', '<|im_end|>'], 'stop_token_ids': [151643, 151644, 151645]}
2024-06-21 19:08:33,402 xinference.core.model 38993 DEBUG    After request chat, current serve request count: 0 for the model qwen:72b
2024-06-21 19:08:33,402 xinference.core.model 38993 DEBUG    Leave wrapped_func, elapsed time: 0 s
INFO 06-21 19:08:33 async_llm_engine.py:564] Received request 994ce71c-2fbe-11ef-b1f7-0cda411d272a: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nAnswer the following questions as best you can. You have access to the following APIs:\n\nm3VVjR: Call this tool to interact with the m3VVjR API. What is the m3VVjR API useful for? 获取用户当前时区的时间。 Parameters: [] Format the arguments as a JSON object.\n\nUse the following format:\n\nQuestion: the input question you must answer\nThought: you should always think about what to do\nAction: the action to take, should be one of [m3VVjR]\nAction Input: the input to the action\nObservation: the result of the action\n... (this Thought/Action/Action Input/Observation can be repeated zero or more times)\nThought: I now know the final answer\nFinal Answer: the final answer to the original input question\n\nBegin!\n\nQuestion: 你是谁<|im_end|>\n<|im_start|>assistant\nThought: I now know the final answer.\nFinal answer:  我是通义千问，由阿里云开发的AI助手。我被设计用来回答各种问题，提供信息和与用户进行对话。有什么我可以帮助你的吗？<|im_end|>\n<|im_start|>user\nQuestion: 获取当前时间<|im_end|>\n<|im_start|>assistant\nThought: I can use m3VVjR.\nAction: m3VVjR\nAction Input: {}<|im_end|>\n<|im_start|>function\nObservation: {\n  "time": "2024-06-21 19:08:26 Friday"\n}<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['<|endoftext|>', '<|im_start|>', '<|im_end|>'], stop_token_ids=[151643, 151644, 151645], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: None, lora_request: None.
INFO 06-21 19:08:36 async_llm_engine.py:133] Finished request 994ce71c-2fbe-11ef-b1f7-0cda411d272a.
2024-06-21 19:08:36,229 xinference.model.llm.utils 38993 DEBUG    Tool call content: 当前时间是2024年6月21日星期五19:08:26。, func: None, args: None

labring / FastGPT