接入Yi模型,使用one-api测试没问题，在进行大模型对话的时候报错

xiaoToby commented 7 months ago

例行检查

[ ] 我已确认目前没有类似 issue
[ ] 我已完整查看过项目 README，以及项目文档
[ ] 我使用了自己的 key，并确认我的 key 是可正常使用的
[ ] 我理解并愿意跟进此 issue，协助测试和提供反馈
[x] 我理解并认可上述内容，并理解项目维护者精力有限，不遵循规则的 issue 可能会被无视或直接关闭

你的版本

[ ] 公有云版本
[ ] 私有部署版本

问题描述 使用one-api测试没问题，但在fastgpt调用时报错 复现步骤 报错日志截图： 200是使用one-api工具调用测试

这是代码部分：

import gc
import traceback
import torch
import uvicorn
import time
import uuid
import anyio
import json
from anyio.streams.memory import MemoryObjectSendStream
from functools import lru_cache
from abc import ABC
from threading import Lock, Thread
from types import MethodType
from argparse import ArgumentParser
from contextlib import asynccontextmanager
from functools import partial
from typing import Dict, List, Any, Optional, Union, Tuple, Iterator, Iterable, AsyncIterator
from loguru import logger
from starlette.concurrency import run_in_threadpool, iterate_in_threadpool
from sse_starlette import EventSourceResponse
from pydantic import BaseModel

from fastapi import FastAPI, HTTPException, Request
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse

from openai.types.model import Model
from openai.types.chat.chat_completion_message import FunctionCall
from openai.types.chat.chat_completion_message_tool_call import ChatCompletionMessageToolCall
from openai.types.completion_usage import CompletionUsage
from openai.types.chat.chat_completion import Choice
from openai.types.chat.chat_completion_chunk import Choice as ChunkChoice
from openai.types.chat.chat_completion_chunk import (
    ChoiceDelta,
    ChoiceDeltaFunctionCall,
    ChoiceDeltaToolCall,
)
from openai.types.chat import (
    ChatCompletionMessage,
    ChatCompletion,
    ChatCompletionChunk,
    ChatCompletionMessageParam,
)

from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, PreTrainedModel
from transformers.generation import GenerationConfig

from utils import (
    Role, 
    ModelList, 
    ChatCompletionCreateParams,
    CompletionCreateParams,
    ErrorCode,
    ErrorResponse,
    model_dump,
    model_parse,
    model_json,
    get_context_length,
    apply_stopping_strings,
    load_model_on_gpus)

llama_outer_lock = Lock()

@asynccontextmanager
async def lifespan(app: FastAPI):  # collects GPU memory
    yield
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.ipc_collect()

app = FastAPI(lifespan=lifespan)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

@app.get("/v1/models")
async def list_models():
    return ModelList(
    data=[
        Model(
            id="yi",
            object="model",
            created=int(time.time()),
            owned_by="open"
        )
    ]
)

TOKEN=os.getenv('ACCESS_TOKEN')
async def verify_token(request: Request):
    auth_header = request.headers.get('Authorization')
    if auth_header:
        token_type, _, token = auth_header.partition(' ')
        if (
            token_type.lower() == "bearer"
            and token == TOKEN
        ):  # 这里配置你的token
            return True
    raise HTTPException(
        status_code=HTTP_401_UNAUTHORIZED,
        detail="Invalid authorization credentials",
    )

@app.post("/v1/chat/completions")
async def create_chat_completion(
    request: ChatCompletionCreateParams,
    raw_request: Request,
    token: bool= Depends(verify_token)
):
    global model, tokenizer

    if len(request.messages) < 1 or request.messages[-1]["role"] == Role.ASSISTANT:
        raise HTTPException(status_code=400, detail="Invalid request")

    request = await handle_request(request, template.stop)
    request.max_tokens = request.max_tokens or 1024

    params = model_dump(request)
    params.update(dict(echo=False))
    logger.debug(f"==== request ====\n{params}")

    iterator_or_completion = await run_in_threadpool(_create_chat_completion, params)

    if isinstance(iterator_or_completion, Iterator):
        # It's easier to ask for forgiveness than permission
        first_response = await run_in_threadpool(next, iterator_or_completion)

        # If no exception was raised from first_response, we can assume that
        # the iterator is valid, and we can use it to stream the response.
        def iterator() -> Iterator:
            yield first_response
            yield from iterator_or_completion

        send_chan, recv_chan = anyio.create_memory_object_stream(10)
        return EventSourceResponse(
            recv_chan,
            data_sender_callable=partial(
                get_event_publisher,
                request=raw_request,
                inner_send_chan=send_chan,
                iterator=iterator(),
            ),
        )
    else:
        return iterator_or_completion

def _create_chat_completion(
    params: Optional[Dict[str, Any]] = None,
    **kwargs,
) -> Union[Iterator, ChatCompletion]:
    params = params or {}
    params.update(kwargs)
    return (
        _create_chat_completion_stream(params)
        if params.get("stream", False)
        else _create_chat_completion_non_stream(params)
    )

def _create_chat_completion_stream(params: Dict[str, Any]) -> Iterator:
    """
    Creates a chat completion stream.

    Args:
        params (Dict[str, Any]): The parameters for generating the chat completion.

    Yields:
        Dict[str, Any]: The output of the chat completion stream.
    """
    _id, _created, _model = None, None, None
    has_function_call = False
    for i, output in enumerate(_generate(params)):
        if output["error_code"] != 0:
            yield output
            return

        _id, _created, _model = output["id"], output["created"], output["model"]
        if i == 0:
            choice = ChunkChoice(
                index=0,
                delta=ChoiceDelta(role="assistant", content=""),
                finish_reason=None,
            )
            yield ChatCompletionChunk(
                id=f"chat{_id}",
                choices=[choice],
                created=_created,
                model=_model,
                object="chat.completion.chunk",
            )

        finish_reason = output["finish_reason"]
        if len(output["delta"]) == 0 and finish_reason != "function_call":
            continue

        function_call = None
        if finish_reason == "function_call":
            try:
                _, function_call = template.parse_assistant_response(
                    output["text"], params.get("functions"), params.get("tools"),
                )
            except Exception as e:
                traceback.print_exc()
                logger.warning("Failed to parse tool call")

        if isinstance(function_call, dict) and "arguments" in function_call:
            has_function_call = True
            function_call = ChoiceDeltaFunctionCall(**function_call)
            delta = ChoiceDelta(
                content=output["delta"],
                function_call=function_call
            )
        elif isinstance(function_call, dict) and "function" in function_call:
            has_function_call = True
            finish_reason = "tool_calls"
            function_call["index"] = 0
            tool_calls = [model_parse(ChoiceDeltaToolCall, function_call)]
            delta = ChoiceDelta(
                content=output["delta"],
                tool_calls=tool_calls,
            )
        else:
            delta = ChoiceDelta(content=output["delta"])

        choice = ChunkChoice(
            index=0,
            delta=delta,
            finish_reason=finish_reason
        )
        yield ChatCompletionChunk(
            id=f"chat{_id}",
            choices=[choice],
            created=_created,
            model=_model,
            object="chat.completion.chunk",
        )

    if not has_function_call:
        choice = ChunkChoice(
            index=0,
            delta=ChoiceDelta(),
            finish_reason="stop"
        )
        yield ChatCompletionChunk(
            id=f"chat{_id}",
            choices=[choice],
            created=_created,
            model=_model,
            object="chat.completion.chunk",
        )

def _create_chat_completion_non_stream(params: Dict[str, Any]) -> Union[ChatCompletion, JSONResponse]:
    """
    Creates a chat completion based on the given parameters.

    Args:
        params (Dict[str, Any]): The parameters for generating the chat completion.

    Returns:
        ChatCompletion: The generated chat completion.
    """
    last_output = None
    for output in _generate(params):
        last_output = output

    if last_output["error_code"] != 0:
        return create_error_response(last_output["error_code"], last_output["text"])

    function_call, finish_reason = None, "stop"
    if params.get("functions") or params.get("tools"):
        try:
            res, function_call = template.parse_assistant_response(
                last_output["text"], params.get("functions"), params.get("tools"),
            )
            last_output["text"] = res
        except Exception as e:
            traceback.print_exc()
            logger.warning("Failed to parse tool call")

    if isinstance(function_call, dict) and "arguments" in function_call:
        finish_reason = "function_call"
        function_call = FunctionCall(**function_call)
        message = ChatCompletionMessage(
            role="assistant",
            content=last_output["text"],
            function_call=function_call,
        )
    elif isinstance(function_call, dict) and "function" in function_call:
        finish_reason = "tool_calls"
        tool_calls = [model_parse(ChatCompletionMessageToolCall, function_call)]
        message = ChatCompletionMessage(
            role="assistant",
            content=last_output["text"],
            tool_calls=tool_calls,
        )
    else:
        message = ChatCompletionMessage(
            role="assistant",
            content=last_output["text"].strip(),
        )

    choice = Choice(
        index=0,
        message=message,
        finish_reason=finish_reason,
    )
    usage = model_parse(CompletionUsage, last_output["usage"])
    return ChatCompletion(
        id=f"chat{last_output['id']}",
        choices=[choice],
        created=last_output["created"],
        model=last_output["model"],
        object="chat.completion",
        usage=usage,
    )

def _generate(params: Dict[str, Any]) -> Iterator:
    """
    Generates text based on the given parameters.

    Args:
        params (Dict[str, Any]): A dictionary containing the parameters for text generation.

    Yields:
        Iterator: A dictionary containing the generated text and error code.
    """
    messages = params.get("messages")
    inputs, prompt = _apply_chat_template(
        messages,
        max_new_tokens=params.get("max_tokens", 256),
        functions=params.get("functions"),
        tools=params.get("tools"),
    )

    params.update(dict(inputs=inputs, prompt=prompt))

    try:
        for output in _generate_stream_func(params):
            output["error_code"] = 0
            yield output

    except (ValueError, RuntimeError) as e:
        traceback.print_exc()
        yield {
            "text": f"{e}",
            "error_code": ErrorCode.INTERNAL_ERROR,
        }

def _apply_chat_template(
    messages: List[ChatCompletionMessageParam],
    max_new_tokens: Optional[int] = 256,
    functions: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
    tools: Optional[List[Dict[str, Any]]] = None,
) -> Tuple[Union[List[int], Dict[str, Any]], Optional[str]]:
    """
    Apply chat template to generate model inputs and prompt.

    Args:
        messages (List[ChatCompletionMessageParam]): List of chat completion message parameters.
        max_new_tokens (Optional[int], optional): Maximum number of new tokens to generate. Defaults to 256.
        functions (Optional[Union[Dict[str, Any], List[Dict[str, Any]]]], optional): Functions to apply to the messages. Defaults to None.
        tools (Optional[List[Dict[str, Any]]], optional): Tools to apply to the messages. Defaults to None.
        **kwargs: Additional keyword arguments.

    Returns:
        Tuple[Union[List[int], Dict[str, Any]], Union[str, None]]: Tuple containing the generated inputs and prompt.
    """
    if template.function_call_available:
        messages = template.postprocess_messages(
            messages, functions, tools=tools,
        )
        if functions or tools:
            logger.debug(f"==== Messages with tools ====\n{messages}")

    prompt = template.apply_chat_template(messages)
    inputs = tokenizer(prompt).input_ids
    if isinstance(inputs, list):
        max_src_len = context_len - max_new_tokens - 1
        inputs = inputs[-max_src_len:]

    return inputs, prompt

@torch.inference_mode()
def _generate_stream_func(
    params: Dict[str, Any],
):
    input_ids = params.get("inputs")
    functions = params.get("functions")
    model_name = params.get("model", "llm")
    temperature = float(params.get("temperature", 1.0))
    repetition_penalty = float(params.get("repetition_penalty", 1.0))
    top_p = float(params.get("top_p", 1.0))
    top_k = int(params.get("top_k", 40))
    max_new_tokens = int(params.get("max_tokens", 256))

    stop_token_ids = params.get("stop_token_ids") or []
    if tokenizer.eos_token_id not in stop_token_ids:
        stop_token_ids.append(tokenizer.eos_token_id)
    stop_strings = params.get("stop", [])

    input_echo_len = len(input_ids)
    device = model.device
    generation_kwargs = dict(
        input_ids=torch.tensor([input_ids], device=device),
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        max_new_tokens=max_new_tokens,
        repetition_penalty=repetition_penalty,
        pad_token_id=tokenizer.pad_token_id,
    )
    if temperature <= 1e-5:
        generation_kwargs["do_sample"] = False
        generation_kwargs.pop("top_k")

    streamer = TextIteratorStreamer(
        tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True
    )
    generation_kwargs["streamer"] = streamer

    if "GenerationMixin" not in str(model.generate.__func__):
        model.generate = MethodType(PreTrainedModel.generate, model)

    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()

    generated_text, func_call_found = "", False
    completion_id: str = f"cmpl-{str(uuid.uuid4())}"
    created: int = int(time.time())
    previous_text = ""
    for i, new_text in enumerate(streamer):
        generated_text += new_text
        if functions:
            _, func_call_found = apply_stopping_strings(generated_text, ["Observation:"])
        generated_text, stop_found = apply_stopping_strings(generated_text, stop_strings)

        if generated_text and generated_text[-1] != "�":
            delta_text = generated_text[len(previous_text):]
            previous_text = generated_text

            yield {
                "id": completion_id,
                "object": "text_completion",
                "created": created,
                "model": model_name,
                "delta": delta_text,
                "text": generated_text,
                "logprobs": None,
                "finish_reason": "function_call" if func_call_found else None,
                "usage": {
                    "prompt_tokens": input_echo_len,
                    "completion_tokens": i,
                    "total_tokens": input_echo_len + i,
                },
            }

        if stop_found:
            break

    yield {
        "id": completion_id,
        "object": "text_completion",
        "created": created,
        "model": model_name,
        "delta": "",
        "text": generated_text,
        "logprobs": None,
        "finish_reason": "stop",
        "usage": {
            "prompt_tokens": input_echo_len,
            "completion_tokens": i,
            "total_tokens": input_echo_len + i,
        },
    }

class YiAITemplate(ABC):
    """ https://huggingface.co/01-ai/Yi-34B-Chat/blob/main/tokenizer_config.json """

    name = "yi"
    system_prompt: Optional[str] = ""
    allow_models = ["yi"]
    stop = {
        "strings": ["<|endoftext|>", "<|im_end|>"],
        "token_ids": [2, 6, 7, 8],  # "<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|im_sep|>"
    }
    function_call_available: Optional[bool] = False

    def apply_chat_template(
        self,
        conversation: List[ChatCompletionMessageParam],
        add_generation_prompt: bool = True,
    ) -> str:
        """
        Converts a Conversation object or a list of dictionaries with `"role"` and `"content"` keys to a prompt.

        Args:
            conversation (List[ChatCompletionMessageParam]): A Conversation object or list of dicts
                with "role" and "content" keys, representing the chat history so far.
            add_generation_prompt (bool, *optional*): Whether to end the prompt with the token(s) that indicate
                the start of an assistant message. This is useful when you want to generate a response from the model.
                Note that this argument will be passed to the chat template, and so it must be supported in the
                template for this argument to have any effect.

        Returns:
            `str`: A prompt, which is ready to pass to the tokenizer.
        """
        # Compilation function uses a cache to avoid recompiling the same template
        compiled_template = _compile_jinja_template(self.template)
        return compiled_template.render(
            messages=conversation,
            add_generation_prompt=add_generation_prompt,
            system_prompt=self.system_prompt,
        )

    @property
    def template(self) -> str:
        return (
            "{% for message in messages %}"
            "{{ '<|im_start|>' + message['role'] + '\\n' + message['content'] + '<|im_end|>' + '\\n' }}"
            "{% endfor %}"
            "{% if add_generation_prompt %}"
            "{{ '<|im_start|>assistant\\n' }}"
            "{% endif %}"
        )

    def postprocess_messages(
        self,
        messages: List[ChatCompletionMessageParam],
        functions: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
        tools: Optional[List[Dict[str, Any]]] = None,
    ) -> List[Dict[str, Any]]:
        return messages

    def parse_assistant_response(
        self,
        output: str,
        functions: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
        tools: Optional[List[Dict[str, Any]]] = None,
    ) -> Tuple[str, Optional[Union[str, Dict[str, Any]]]]:
        return output, None

@lru_cache
def _compile_jinja_template(chat_template: str):
    """
    Compile a Jinja template from a string.

    Args:
        chat_template (str): The string representation of the Jinja template.

    Returns:
        jinja2.Template: The compiled Jinja template.

    Examples:
        >>> template_string = "Hello, {{ name }}!"
        >>> template = _compile_jinja_template(template_string)
    """
    try:
        from jinja2.exceptions import TemplateError
        from jinja2.sandbox import ImmutableSandboxedEnvironment
    except ImportError:
        raise ImportError("apply_chat_template requires jinja2 to be installed.")

    def raise_exception(message):
        raise TemplateError(message)

    jinja_env = ImmutableSandboxedEnvironment(trim_blocks=True, lstrip_blocks=True)
    jinja_env.globals["raise_exception"] = raise_exception
    return jinja_env.from_string(chat_template)

async def handle_request(
        request: Union[CompletionCreateParams, ChatCompletionCreateParams],
        stop: Dict[str, Any] = None
) -> Union[Union[CompletionCreateParams, ChatCompletionCreateParams], JSONResponse]:
    error_check_ret = check_requests(request)
    if error_check_ret is not None:
        raise error_check_ret

    # stop settings
    _stop, _stop_token_ids = [], []
    if stop is not None:
        _stop_token_ids = stop.get("token_ids", [])
        _stop = stop.get("strings", [])

    request.stop = request.stop or []
    if isinstance(request.stop, str):
        request.stop = [request.stop]

    if request.functions:
        request.stop.append("Observation:")

    request.stop = list(set(_stop + request.stop))
    request.stop_token_ids = request.stop_token_ids or []
    request.stop_token_ids = list(set(_stop_token_ids + request.stop_token_ids))

    return request

def check_requests(request: Union[CompletionCreateParams, ChatCompletionCreateParams]) -> Optional[JSONResponse]:
    # Check all params
    if request.max_tokens is not None and request.max_tokens <= 0:
        return create_error_response(
            ErrorCode.PARAM_OUT_OF_RANGE,
            f"{request.max_tokens} is less than the minimum of 1 - 'max_tokens'",
        )
    if request.n is not None and request.n <= 0:
        return create_error_response(
            ErrorCode.PARAM_OUT_OF_RANGE,
            f"{request.n} is less than the minimum of 1 - 'n'",
        )
    if request.temperature is not None and request.temperature < 0:
        return create_error_response(
            ErrorCode.PARAM_OUT_OF_RANGE,
            f"{request.temperature} is less than the minimum of 0 - 'temperature'",
        )
    if request.temperature is not None and request.temperature > 2:
        return create_error_response(
            ErrorCode.PARAM_OUT_OF_RANGE,
            f"{request.temperature} is greater than the maximum of 2 - 'temperature'",
        )
    if request.top_p is not None and request.top_p < 0:
        return create_error_response(
            ErrorCode.PARAM_OUT_OF_RANGE,
            f"{request.top_p} is less than the minimum of 0 - 'top_p'",
        )
    if request.top_p is not None and request.top_p > 1:
        return create_error_response(
            ErrorCode.PARAM_OUT_OF_RANGE,
            f"{request.top_p} is greater than the maximum of 1 - 'temperature'",
        )
    if request.stop is None or isinstance(request.stop, (str, list)):
        return None
    else:
        return create_error_response(
            ErrorCode.PARAM_OUT_OF_RANGE,
            f"{request.stop} is not valid under any of the given schemas - 'stop'",
        )

def create_error_response(code: int, message: str) -> JSONResponse:
    return JSONResponse(model_dump(ErrorResponse(message=message, code=code)), status_code=500)

async def get_event_publisher(
    request: Request,
    inner_send_chan: MemoryObjectSendStream,
    iterator: Union[Iterator, AsyncIterator],
):
    async with inner_send_chan:
        try:
            async for chunk in iterate_in_threadpool(iterator):
                if isinstance(chunk, BaseModel):
                    chunk = model_json(chunk)
                elif isinstance(chunk, dict):
                    chunk = json.dumps(chunk, ensure_ascii=False)

                await inner_send_chan.send(dict(data=chunk))

                if await request.is_disconnected():
                    raise anyio.get_cancelled_exc_class()()

                if llama_outer_lock.locked():
                    await inner_send_chan.send(dict(data="[DONE]"))
                    raise anyio.get_cancelled_exc_class()()
        except anyio.get_cancelled_exc_class() as e:
            logger.info("disconnected")
            with anyio.move_on_after(1, shield=True):
                logger.info(f"Disconnected from client (via refresh/close) {request.client}")
                raise e

def _get_args():
    parser = ArgumentParser()
    parser.add_argument(
        "-c",
        "--checkpoint-path",
        type=str,
        default="model/Yi-34B-Chat-8bits/",
        help="Checkpoint name or path, default to %(default)r",
    )
    parser.add_argument(
        "--cpu-only", action="store_true", help="Run demo with CPU only"
    )
    parser.add_argument(
        "--server-port", type=int, default=8000, help="Demo server port."
    )
    parser.add_argument(
        "--server-name",
        type=str,
        default="127.0.0.1",
        help="Demo server name. Default: 127.0.0.1, which is only visible from the local computer."
        " If you want other computers to access your server, use 0.0.0.0 instead.",
    )
    parser.add_argument(
        "--context_len", type=int, default=None, help="Context length for generating completions."
    )
    parser.add_argument("--disable-gc", action="store_true",
                        help="Disable GC after each response generated.")

    args = parser.parse_args()
    return args

if __name__ == "__main__":
    args = _get_args()

    tokenizer = AutoTokenizer.from_pretrained(
        args.checkpoint_path
    )

    if args.cpu_only:
        device = "cpu"
    else:
        device = "cuda"

    model = AutoModelForCausalLM.from_pretrained(
        args.checkpoint_path,
        torch_dtype='auto'
    ).to(device).eval()

    model.generation_config = GenerationConfig.from_pretrained(
        args.checkpoint_path
    )

    context_len = get_context_length(model.config) if args.context_len is None else args.context_len
    template = YiAITemplate()

    # uvicorn.run(app, host=args.server_name, port=args.server_port, workers=1)
    uvicorn.run(app, host="0.0.0.0", port=7002, workers=1, log_config=log_config)

预期结果

相关截图

c121914yu commented 7 months ago

没遇到，用 yi 官方 key 正常使用，

xiaoToby commented 7 months ago

没遇到，用 yi 官方 key 正常使用，

这是报错日志

xiaoToby commented 7 months ago

没遇到，用 yi 官方 key 正常使用，

我用oneapi测试没有问题，在fastgpt中使用就有问题

c121914yu commented 7 months ago

没遇到，用 yi 官方 key 正常使用，

我用oneapi测试没有问题，在fastgpt中使用就有问题

你测过流了？只测了非流吧

xiaoToby commented 7 months ago

什么是流和非流？

c121914yu commented 7 months ago

什么是流和非流？

就是stream=true/fasle的模式，oneapi测试默认是false，你得手动发起curl请求或者使用apifox之类去测试stream模式。

nongmo677 commented 7 months ago

看报错是你的pad_token_id判定出了问题，可以检查一下你用的是yi的哪个模型，去看一下最大支持了多少max_new_tokens ，是不是在界面上选的回复上限太大了，超过了这个

xiaoToby commented 7 months ago

看报错是你的pad_token_id判定出了问题，可以检查一下你用的是yi的哪个模型，去看一下最大支持了多少max_new_tokens ，是不是在界面上选的回复上限太大了，超过了这个

用的是01-ai/Yi-6B-Chat-8bits；在我openapi文件中使用的max_new_tokens=256，在fastgpt应用》高级编排》AI配置中更改了回复上线<256后，可以使用了那意思是我应该在我的openapi中将max_new_tokens的数值设置更大？ @nongmo677

nongmo677 commented 7 months ago

看报错是你的pad_token_id判定出了问题，可以检查一下你用的是yi的哪个模型，去看一下最大支持了多少max_new_tokens ，是不是在界面上选的回复上限太大了，超过了这个

用的是01-ai/Yi-6B-Chat-8bits；在我openapi文件中使用的max_new_tokens=256，在fastgpt应用》高级编排》AI配置中更改了回复上线<256后，可以使用了那意思是我应该在我的openapi中将max_new_tokens的数值设置更大？ @nongmo677

@xiaoToby 应该说，你的参数不太对，fastgpt对接的最多是openai 的api或者是一些开源项目仿openai api接口，没有叫做max_new_tokens的参数，正确的参数叫做max_tokens，所以你调整fastgpt的回复上限的时候就会正确了。其次，从https://huggingface.co/01-ai/Yi-6B-Chat-8bits/blob/main/tokenizer_config.json 提供的文件来看，最大应该是支持4096。结论：你直接调max_new_tokens的大小也行，但是建议用映射的方式，还是要动态接收fastgpt传过来的openai apic参数，建议你还是用一些别的项目部署

mingmars commented 3 months ago

2024-08-05 2024-08-05 2024-08-05 00:41:55 { 2024-08-05 00:41:55 2024-08-05 00:41:55 2024-08-05 00:41:55 2024-08-05 00:41:55 2024-08-05 00:41:55 2024-08-05 00:41:55 2024-08-05 00:41:55 2024-08-05 00:41:55 2024-08-05 00:41:55 2024-08-05 00:41:55 2024-08-05 00:41:55 2024-08-05 00:41:55 2024-08-05 00:41:55 } 2024-08-05 00:42:54 { 2024-08-05 00:42:54 2024-08-05 00:42:54 2024-08-05 00:42:54 2024-08-05 00:42:54 2024-08-05 00:42:54 2024-08-05 00:42:54 2024-08-05 00:42:54 2024-08-05 00:42:54 2024-08-05 00:42:54 2024-08-05 00:42:54 2024-08-05 00:42:54 2024-08-05 00:42:54 2024-08-05 00:42:54 } 2024-08-05 00:43:52 { 2024-08-05 00:43:52 2024-08-05 00:43:52 2024-08-05 00:43:52 2024-08-05 00:43:52 2024-08-05 00:43:52 2024-08-05 00:43:52 2024-08-05 00:43:52 2024-08-05 00:43:52 2024-08-05 00:43:52 2024-08-05 00:43:52 2024-08-05 00:43:52 2024-08-05 00:43:52 2024-08-05 00:43:52 } 2024-08-05 00:45:00 { 2024-08-05 00:45:00 2024-08-05 00:45:00 2024-08-05 00:45:00 2024-08-05 00:45:00 2024-08-05 00:45:00 2024-08-05 00:45:00 2024-08-05 00:45:00 2024-08-05 00:45:00 2024-08-05 00:45:00 2024-08-05 00:45:00 2024-08-05 00:45:00 2024-08-05 00:45:00 2024-08-05 00:45:00 } 2024-08-05 00:45:36 { 2024-08-05 00:45:36 2024-08-05 00:45:36 2024-08-05 00:45:36 2024-08-05 00:45:36 2024-08-05 00:45:36 2024-08-05 00:45:36 2024-08-05 00:45:36 2024-08-05 00:45:36 2024-08-05 00:45:36 2024-08-05 00:45:36 2024-08-05 00:45:36 2024-08-05 00:45:36 2024-08-05 00:45:36 } 2024-08-05 00:50:30 { 2024-08-05 00:50:30 2024-08-05 00:50:30 2024-08-05 00:50:30 2024-08-05 00:50:30 2024-08-05 00:50:30 2024-08-05 00:50:30 2024-08-05 00:50:30 2024-08-05 00:50:30 2024-08-05 00:50:30 2024-08-05 00:50:30 2024-08-05 00:50:30 2024-08-05 00:50:30 2024-08-05 00:50:30 } 2024-08-05 00:50:34 { 2024-08-05 00:50:34 2024-08-05 00:50:34 2024-08-05 00:50:34 2024-08-05 00:50:34 2024-08-05 00:50:34 2024-08-05 00:50:34 2024-08-05 00:50:34 2024-08-05 00:50:34 2024-08-05 00:50:34 2024-08-05 00:50:34 2024-08-05 00:50:34 2024-08-05 00:50:34 00:42:54 [Warn] 2024-08-04 16:42:54 LLM response error {"requestBody":{"model":"llamazk:latest","temperature":0.36,"max_tokens":2000,"stream":true,"messages":[{"role":"system","content":"回答要用中文，要有语义符号分断！"},{"role":"user","content":"你是谁"}]}} 00:42:54 [Error] 2024-08-04 16:42:54 sse error: LLM model response empty message: 'LLM model response empty', stack: 'Error: LLM model response empty\n' + ' at /app/projects/app/.next/server/chunks/96960.js:318:790\n' + ' at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n' + ' at async Object.q [as chatNode] (/app/projects/app/.next/server/chunks/96960.js:318:645)\n' + ' at async k (/app/projects/app/.next/server/chunks/96960.js:319:2662)\n' + ' at async Promise.all (index 0)\n' + ' at async C (/app/projects/app/.next/server/chunks/96960.js:319:3248)\n' + ' at async v (/app/projects/app/.next/server/pages/api/core/chat/chatTest.js:1:7066)\n' + ' at async /app/projects/app/.next/server/pages/api/core/app/detail.js:1:5511\n' + ' at async K (/app/nodemodules/.pnpm/next@14.2.5@babel+core@7.24.9_react-dom@18.3.1_react@18.3.1react@18.3.1_sass@1.77.8/node_modules/next/dist/compiled/next-server/pages-api.runtime.prod.js:20:16853)\n' + ' at async U.render (/app/nodemodules/.pnpm/next@14.2.5@babel+core@7.24.9_react-dom@18.3.1_react@18.3.1react@18.3.1_sass@1.77.8/node_modules/next/dist/compiled/next-server/pages-api.runtime.prod.js:20:17492)' message: 'LLM model response empty', stack: 'Error: LLM model response empty\n' + ' at /app/projects/app/.next/server/chunks/96960.js:318:790\n' + ' at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n' + ' at async Object.q [as chatNode] (/app/projects/app/.next/server/chunks/96960.js:318:645)\n' + ' at async k (/app/projects/app/.next/server/chunks/96960.js:319:2662)\n' + ' at async Promise.all (index 0)\n' + ' at async C (/app/projects/app/.next/server/chunks/96960.js:319:3248)\n' + ' at async v (/app/projects/app/.next/server/pages/api/core/chat/chatTest.js:1:7066)\n' + ' at async /app/projects/app/.next/server/pages/api/core/app/detail.js:1:5511\n' + ' at async K (/app/nodemodules/.pnpm/next@14.2.5@babel+core@7.24.9_react-dom@18.3.1_react@18.3.1react@18.3.1_sass@1.77.8/node_modules/next/dist/compiled/next-server/pages-api.runtime.prod.js:20:16853)\n' + ' at async U.render (/app/nodemodules/.pnpm/next@14.2.5@babel+core@7.24.9_react-dom@18.3.1_react@18.3.1react@18.3.1_sass@1.77.8/node_modules/next/dist/compiled/next-server/pages-api.runtime.prod.js:20:17492)' message: 'LLM model response empty', stack: 'Error: LLM model response empty\n' + ' at /app/projects/app/.next/server/chunks/96960.js:318:790\n' + ' at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n' + ' at async Object.q [as chatNode] (/app/projects/app/.next/server/chunks/96960.js:318:645)\n' + ' at async k (/app/projects/app/.next/server/chunks/96960.js:319:2662)\n' + ' at async Promise.all (index 0)\n' + ' at async C (/app/projects/app/.next/server/chunks/96960.js:319:3248)\n' + ' at async v (/app/projects/app/.next/server/pages/api/core/chat/chatTest.js:1:7066)\n' + ' at async /app/projects/app/.next/server/pages/api/core/app/detail.js:1:5511\n' + ' at async K (/app/nodemodules/.pnpm/next@14.2.5@babel+core@7.24.9_react-dom@18.3.1_react@18.3.1react@18.3.1_sass@1.77.8/node_modules/next/dist/compiled/next-server/pages-api.runtime.prod.js:20:16853)\n' + ' at async U.render (/app/nodemodules/.pnpm/next@14.2.5@babel+core@7.24.9_react-dom@18.3.1_react@18.3.1react@18.3.1_sass@1.77.8/node_modules/next/dist/compiled/next-server/pages-api.runtime.prod.js:20:17492)' message: 'LLM model response empty', stack: 'Error: LLM model response empty\n' + ' at /app/projects/app/.next/server/chunks/96960.js:318:790\n' + ' at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n' + ' at async Object.q [as chatNode] (/app/projects/app/.next/server/chunks/96960.js:318:645)\n' + ' at async k (/app/projects/app/.next/server/chunks/96960.js:319:2662)\n' + ' at async Promise.all (index 0)\n' + ' at async C (/app/projects/app/.next/server/chunks/96960.js:319:3248)\n' + ' at async J (/app/projects/app/.next/server/pages/api/v1/chat/completions.js:63:12278)\n' + ' at async /app/projects/app/.next/server/pages/api/core/app/detail.js:1:5511\n' + ' at async K (/app/nodemodules/.pnpm/next@14.2.5@babel+core@7.24.9_react-dom@18.3.1_react@18.3.1react@18.3.1_sass@1.77.8/node_modules/next/dist/compiled/next-server/pages-api.runtime.prod.js:20:16853)\n' + ' at async U.render (/app/nodemodules/.pnpm/next@14.2.5@babel+core@7.24.9_react-dom@18.3.1_react@18.3.1react@18.3.1_sass@1.77.8/node_modules/next/dist/compiled/next-server/pages-api.runtime.prod.js:20:17492)' message: 'LLM model response empty', stack: 'Error: LLM model response empty\n' + ' at /app/projects/app/.next/server/chunks/96960.js:318:790\n' + ' at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n' + ' at async Object.q [as chatNode] (/app/projects/app/.next/server/chunks/96960.js:318:645)\n' + ' at async k (/app/projects/app/.next/server/chunks/96960.js:319:2662)\n' + ' at async Promise.all (index 0)\n' + ' at async C (/app/projects/app/.next/server/chunks/96960.js:319:3248)\n' + ' at async J (/app/projects/app/.next/server/pages/api/v1/chat/completions.js:63:12278)\n' + ' at async /app/projects/app/.next/server/pages/api/core/app/detail.js:1:5511\n' + ' at async K (/app/nodemodules/.pnpm/next@14.2.5@babel+core@7.24.9_react-dom@18.3.1_react@18.3.1react@18.3.1_sass@1.77.8/node_modules/next/dist/compiled/next-server/pages-api.runtime.prod.js:20:16853)\n' + ' at async U.render (/app/nodemodules/.pnpm/next@14.2.5@babel+core@7.24.9_react-dom@18.3.1_react@18.3.1react@18.3.1_sass@1.77.8/node_modules/next/dist/compiled/next-server/pages-api.runtime.prod.js:20:17492)' message: 'LLM model response empty', stack: 'Error: LLM model response empty\n' + ' at /app/projects/app/.next/server/chunks/96960.js:318:790\n' + ' at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n' + ' at async Object.q [as chatNode] (/app/projects/app/.next/server/chunks/96960.js:318:645)\n' + ' at async k (/app/projects/app/.next/server/chunks/96960.js:319:2662)\n' + ' at async Promise.all (index 0)\n' + ' at async C (/app/projects/app/.next/server/chunks/96960.js:319:3248)\n' + ' at async J (/app/projects/app/.next/server/pages/api/v1/chat/completions.js:63:12278)\n' + ' at async /app/projects/app/.next/server/pages/api/core/app/detail.js:1:5511\n' + ' at async K (/app/nodemodules/.pnpm/next@14.2.5@babel+core@7.24.9_react-dom@18.3.1_react@18.3.1react@18.3.1_sass@1.77.8/node_modules/next/dist/compiled/next-server/pages-api.runtime.prod.js:20:16853)\n' + ' at async U.render (/app/nodemodules/.pnpm/next@14.2.5@babel+core@7.24.9_react-dom@18.3.1_react@18.3.1react@18.3.1_sass@1.77.8/node_modules/next/dist/compiled/next-server/pages-api.runtime.prod.js:20:17492)' message: 'LLM model response empty', stack: 'Error: LLM model response empty\n' + ' at /app/projects/app/.next/server/chunks/96960.js:318:790\n' + ' at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n' + ' at async Object.q [as chatNode] (/app/projects/app/.next/server/chunks/96960.js:318:645)\n' + ' at async k (/app/projects/app/.next/server/chunks/96960.js:319:2662)\n' + ' at async Promise.all (index 0)\n' + ' at async C (/app/projects/app/.next/server/chunks/96960.js:319:3248)\n' + ' at async J (/app/projects/app/.next/server/pages/api/v1/chat/completions.js:63:12278)\n' + ' at async /app/projects/app/.next/server/pages/api/core/app/detail.js:1:5511\n' + ' at async K (/app/nodemodules/.pnpm/next@14.2.5@babel+core@7.24.9_react-dom@18.3.1_react@18.3.1react@18.3.1_sass@1.77.8/node_modules/next/dist/compiled/next-server/pages-api.runtime.prod.js:20:16853)\n' + ' at async U.render (/app/nodemodules/.pnpm/next@14.2.5@babel+core@7.24.9_react-dom@18.3.1_react@18.3.1react@18.3.1_sass@1.77.8/node_modules/next/dist/compiled/next-server/pages-api.runtime.prod.js:20:17492)'

oneapi用pyt程序调用正常返回，但用fastgpt就不行

xiaoToby commented 3 months ago

你贴一下启用yi模型api的py文件

labring / FastGPT

接入Yi模型,使用one-api测试没问题，在进行大模型对话的时候报错 #1057