Open samzong opened 6 months ago
I think I know the problem, my trition backend use trition with vllm.
Do we have a plan to support it?
it's not planned yet, but I think it's trivial to adapt the codes for your use case.
it's not planned yet, but I think it's trivial to adapt the codes for your use case.
Do you have any suggestions? I can try to implement it, and if I can, I can contribute this part of the code
Can you provide how to calling vllm-based triton backend? The grpc interface, the parameters for example to call the service.
okay, @npuichigo You can see an example here.
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"-m",
"--model",
type=str,
required=False,
default="vllm_model",
help="Model name",
)
parser.add_argument(
"-v",
"--verbose",
action="store_true",
required=False,
default=False,
help="Enable verbose output",
)
parser.add_argument(
"-u",
"--url",
type=str,
required=False,
default="localhost:8001",
help="Inference server URL and its gRPC port. Default is localhost:8001.",
)
parser.add_argument(
"-t",
"--stream-timeout",
type=float,
required=False,
default=None,
help="Stream timeout in seconds. Default is None.",
)
parser.add_argument(
"--offset",
type=int,
required=False,
default=0,
help="Add offset to request IDs used",
)
parser.add_argument(
"--input-prompts",
type=str,
required=False,
default="prompts.txt",
help="Text file with input prompts",
)
parser.add_argument(
"--results-file",
type=str,
required=False,
default="results.txt",
help="The file with output results",
)
parser.add_argument(
"--iterations",
type=int,
required=False,
default=1,
help="Number of iterations through the prompts file",
)
parser.add_argument(
"-s",
"--streaming-mode",
action="store_true",
required=False,
default=False,
help="Enable streaming mode",
)
parser.add_argument(
"--exclude-inputs-in-outputs",
action="store_true",
required=False,
default=False,
help="Exclude prompt from outputs",
)
my trition backend also use trition with vllm,have a plan to support it?
Would be great if vllm option was supported.
I made some change to let it support vllm backend. https://github.com/ChaseDreamInfinity/openai_triton_vllm
{"timestamp":"2024-04-15T05:20:55.796456Z","level":"ERROR","error":"AppError(error message received from triton: [request id:] expected number of inputs between 1 and 3 but got 9 inputs for model 'myserving')","target":"openai_trtllm::routes::completions","span":{"headers":"{\"host\": \"localhost:3030\", \"user-agent\": \"OpenAI/Python 1.17.1\", \"content-length\": \"55\", \"accept\": \"application/json\", \"accept-encoding\": \"gzip, deflate\", \"authorization\": \"Bearer test\", \"content-type\": \"application/json\", \"x-stainless-arch\": \"arm64\", \"x-stainless-async\": \"false\", \"x-stainless-lang\": \"python\", \"x-stainless-os\": \"MacOS\", \"x-stainless-package-version\": \"1.17.1\", \"x-stainless-runtime\": \"CPython\", \"x-stainless-runtime-version\": \"3.10.5\"}","name":"non-streaming completions"},"spans":[{"http.request.method":"POST","http.route":"/v1/completions","network.protocol.version":"1.1","otel.kind":"Server","otel.name":"POST /v1/completions","server.address":"localhost:3030","span.type":"web","url.path":"/v1/completions","url.scheme":"","user_agent.original":"OpenAI/Python 1.17.1","name":"HTTP request"},{"headers":"{\"host\": \"localhost:3030\", \"user-agent\": \"OpenAI/Python 1.17.1\", \"content-length\": \"55\", \"accept\": \"application/json\", \"accept-encoding\": \"gzip, deflate\", \"authorization\": \"Bearer test\", \"content-type\": \"application/json\", \"x-stainless-arch\": \"arm64\", \"x-stainless-async\": \"false\", \"x-stainless-lang\": \"python\", \"x-stainless-os\": \"MacOS\", \"x-stainless-package-version\": \"1.17.1\", \"x-stainless-runtime\": \"CPython\", \"x-stainless-runtime-version\": \"3.10.5\"}","name":"completions"},{"headers":"{\"host\": \"localhost:3030\", \"user-agent\": \"OpenAI/Python 1.17.1\", \"content-length\": \"55\", \"accept\": \"application/json\", \"accept-encoding\": \"gzip, deflate\", \"authorization\": \"Bearer test\", \"content-type\": \"application/json\", \"x-stainless-arch\": \"arm64\", \"x-stainless-async\": \"false\", \"x-stainless-lang\": \"python\", \"x-stainless-os\": \"MacOS\", \"x-stainless-package-version\": \"1.17.1\", \"x-stainless-runtime\": \"CPython\", \"x-stainless-runtime-version\": \"3.10.5\"}","name":"non-streaming completions"}]}
use
client/openai_completion.py