npuichigo / openai_trtllm

OpenAI compatible API for TensorRT LLM triton backend
MIT License
163 stars 25 forks source link

ERROR: expected number of inputs between 1 and 3 but got 9 inputs for model #38

Open samzong opened 6 months ago

samzong commented 6 months ago

{"timestamp":"2024-04-15T05:20:55.796456Z","level":"ERROR","error":"AppError(error message received from triton: [request id: ] expected number of inputs between 1 and 3 but got 9 inputs for model 'myserving')","target":"openai_trtllm::routes::completions","span":{"headers":"{\"host\": \"localhost:3030\", \"user-agent\": \"OpenAI/Python 1.17.1\", \"content-length\": \"55\", \"accept\": \"application/json\", \"accept-encoding\": \"gzip, deflate\", \"authorization\": \"Bearer test\", \"content-type\": \"application/json\", \"x-stainless-arch\": \"arm64\", \"x-stainless-async\": \"false\", \"x-stainless-lang\": \"python\", \"x-stainless-os\": \"MacOS\", \"x-stainless-package-version\": \"1.17.1\", \"x-stainless-runtime\": \"CPython\", \"x-stainless-runtime-version\": \"3.10.5\"}","name":"non-streaming completions"},"spans":[{"http.request.method":"POST","http.route":"/v1/completions","network.protocol.version":"1.1","otel.kind":"Server","otel.name":"POST /v1/completions","server.address":"localhost:3030","span.type":"web","url.path":"/v1/completions","url.scheme":"","user_agent.original":"OpenAI/Python 1.17.1","name":"HTTP request"},{"headers":"{\"host\": \"localhost:3030\", \"user-agent\": \"OpenAI/Python 1.17.1\", \"content-length\": \"55\", \"accept\": \"application/json\", \"accept-encoding\": \"gzip, deflate\", \"authorization\": \"Bearer test\", \"content-type\": \"application/json\", \"x-stainless-arch\": \"arm64\", \"x-stainless-async\": \"false\", \"x-stainless-lang\": \"python\", \"x-stainless-os\": \"MacOS\", \"x-stainless-package-version\": \"1.17.1\", \"x-stainless-runtime\": \"CPython\", \"x-stainless-runtime-version\": \"3.10.5\"}","name":"completions"},{"headers":"{\"host\": \"localhost:3030\", \"user-agent\": \"OpenAI/Python 1.17.1\", \"content-length\": \"55\", \"accept\": \"application/json\", \"accept-encoding\": \"gzip, deflate\", \"authorization\": \"Bearer test\", \"content-type\": \"application/json\", \"x-stainless-arch\": \"arm64\", \"x-stainless-async\": \"false\", \"x-stainless-lang\": \"python\", \"x-stainless-os\": \"MacOS\", \"x-stainless-package-version\": \"1.17.1\", \"x-stainless-runtime\": \"CPython\", \"x-stainless-runtime-version\": \"3.10.5\"}","name":"non-streaming completions"}]}

use client/openai_completion.py

samzong commented 6 months ago

I think I know the problem, my trition backend use trition with vllm.

Do we have a plan to support it?

npuichigo commented 6 months ago

it's not planned yet, but I think it's trivial to adapt the codes for your use case.

samzong commented 6 months ago

it's not planned yet, but I think it's trivial to adapt the codes for your use case.

Do you have any suggestions? I can try to implement it, and if I can, I can contribute this part of the code

npuichigo commented 6 months ago

Can you provide how to calling vllm-based triton backend? The grpc interface, the parameters for example to call the service.

samzong commented 6 months ago

okay, @npuichigo You can see an example here.

https://github.com/triton-inference-server/vllm_backend/blob/a01475157290bdf6fd0f50688f69aafea41b04c5/samples/client.py#L192

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-m",
        "--model",
        type=str,
        required=False,
        default="vllm_model",
        help="Model name",
    )
    parser.add_argument(
        "-v",
        "--verbose",
        action="store_true",
        required=False,
        default=False,
        help="Enable verbose output",
    )
    parser.add_argument(
        "-u",
        "--url",
        type=str,
        required=False,
        default="localhost:8001",
        help="Inference server URL and its gRPC port. Default is localhost:8001.",
    )
    parser.add_argument(
        "-t",
        "--stream-timeout",
        type=float,
        required=False,
        default=None,
        help="Stream timeout in seconds. Default is None.",
    )
    parser.add_argument(
        "--offset",
        type=int,
        required=False,
        default=0,
        help="Add offset to request IDs used",
    )
    parser.add_argument(
        "--input-prompts",
        type=str,
        required=False,
        default="prompts.txt",
        help="Text file with input prompts",
    )
    parser.add_argument(
        "--results-file",
        type=str,
        required=False,
        default="results.txt",
        help="The file with output results",
    )
    parser.add_argument(
        "--iterations",
        type=int,
        required=False,
        default=1,
        help="Number of iterations through the prompts file",
    )
    parser.add_argument(
        "-s",
        "--streaming-mode",
        action="store_true",
        required=False,
        default=False,
        help="Enable streaming mode",
    )
    parser.add_argument(
        "--exclude-inputs-in-outputs",
        action="store_true",
        required=False,
        default=False,
        help="Exclude prompt from outputs",
    )
liyan77 commented 3 months ago

my trition backend also use trition with vllm,have a plan to support it?

crslen commented 2 months ago

Would be great if vllm option was supported.

ChaseDreamInfinity commented 1 month ago

I made some change to let it support vllm backend. https://github.com/ChaseDreamInfinity/openai_triton_vllm