Closed jaywongs closed 7 months ago
I think you should use the gRPC port of triton. Maybe it's 3001 in your case?
I think you should use the gRPC port of triton. Maybe it's 3001 in your case?
I used the default gRPC port (8001) to communicate with Triton.
I0407 07:25:59.905119 103 server.cc:677]
+------------------+---------+--------+
| Model | Version | Status |
+------------------+---------+--------+
| ensemble | 1 | READY |
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | READY |
| tensorrt_llm_bls | 1 | READY |
+------------------+---------+--------+
I0407 07:25:59.966723 103 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA A100-SXM4-80GB
I0407 07:25:59.969556 103 metrics.cc:770] Collecting CPU metrics
I0407 07:25:59.969690 103 tritonserver.cc:2508]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.43.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /tensorrtllm_backend/triton_model_repo |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0407 07:25:59.971120 103 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8001
I0407 07:25:59.971319 103 http_server.cc:4637] Started HTTPService at 0.0.0.0:8000
I0407 07:26:00.012131 103 http_server.cc:320] Started Metrics Service at 0.0.0.0:8002
and my docker-compose.yml as following:
version: "3"
services:
openai_trtllm:
image: openai_trtllm
build:
context: .
dockerfile: Dockerfile
command:
- "--host"
- "0.0.0.0"
- "--port"
- "3000"
- "--triton-endpoint"
- "http://tensorrtllm_backend:8001"
ports:
- "3000:3000"
depends_on:
- tensorrtllm_backend
restart: on-failure
# Triton backend for TensorRT LLM
tensorrtllm_backend:
image: triton_trt_llm:latest
volumes:
- /tensor/tensorrtllm_backend:/tensorrtllm_backend
command:
- "/bin/bash"
- "-c"
- |
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo
tail -f /dev/null
ports:
- "8000:8000"
- "8001:8001"
- "8002:8002"
deploy:
replicas: 1
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [ gpu ]
shm_size: '2g'
ulimits:
memlock: -1
stack: 67108864
restart: on-failure
i think the port was right here, maybe?
could u try this https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/end_to_end_grpc_client.py to check the ensemle?
could u try this https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/end_to_end_grpc_client.py to check the ensemle?
The output of 'end_to_end_grpc_client.py' appears to be correct.
root@94b0bf360805:/tensorrtllm_backend/inflight_batcher_llm/client# python3 end_to_end_grpc_client.py --output-len 100 --prompt "What is machine learning?"
[b'What is machine learning?\n\nMachine learning is a field of computer science that uses statistical techniques to give computers the ability to "learn" (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed.\n\n### What is supervised learning?\n\nSupervised learning is a type of machine learning algorithm used for regression and classification problems. In supervised learning, we are given a set of data points {x<sub>1</sub>, x<sub']
root@94b0bf360805:/tensorrtllm_backend/inflight_batcher_llm/client# python3 end_to_end_grpc_client.py --output-len 100 --prompt "What is machine learning?" --streaming
Machinelearningisafieldofcomputersciencethatusesstatisticaltechniquestogivecomputerstheabilityto"learn"(i.e.,progressivelyimproveperformanceonaspecifictask)withdata,withoutbeingexplicitlyprogrammed.
###Whatissupervisedlearning?
Supervisedlearningisatypeofmachinelearningalgorithmusedforregressionandclassificationproblems.Insupervisedlearning,wearegivenasetofdatapoints{x<sub>1</sub>,x<sub
The streaming mode output lacked spaces, as mentioned in the issue regarding tensorrt-llm-background.
Can u use completion instead of chat completion to make sure the inputs are same to compare?And it's better to set RUST_LOG to debug.
I tested codellama and it works for me.
Can u use completion instead of chat completion to make sure the inputs are same to compare?And it's better to set RUST_LOG to debug.
I tested codellama and it works for me.
thanks for your patient, i open the debug mode and try with /v1/completion, here is the log in detail:
curl --location 'http://127.0.0.1:3000/v1/completions' \
> --header 'Content-Type: application/json' \
> --data '{
> "model": "ensemble",
> "prompt": "What is machine learning?",
> "max_tokens": 200,
> "temperature": 0
> }'
{"id":"cmpl-8f38b74e-1630-43c9-9586-f9e7d1e89345","object":"text_completion","created":1712719920,"model":"ensemble","choices":[{"text":"What is machine learning?","index":0,"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
and the debug log:
openai_trtllm-1 | 2024-04-10T03:31:53.761914Z INFO openai_trtllm::routes::completions: request: Json(CompletionCreateParams { model: "ensemble", prompt: ["What is machine learning?"], best_of: 1, echo: false, frequency_penalty: 0.0, logit_bias: None, logprobs: None, max_tokens: 200, n: 1, presence_penalty: 0.0, seed: None, stop: None, stream: false, suffix: None, temperature: 0.0, top_p: 1.0, user: None })
openai_trtllm-1 | at src/routes/completions.rs:35 on tokio-runtime-worker
openai_trtllm-1 | in openai_trtllm::routes::completions::completions with headers: {"host": "127.0.0.1:3000", "user-agent": "curl/8.2.1", "accept": "*/*", "content-type": "application/json", "content-length": "117"}
openai_trtllm-1 | in otel::tracing::HTTP request with http.request.method: POST, network.protocol.version: 1.1, server.address: "127.0.0.1:3000", user_agent.original: "curl/8.2.1", url.path: "/v1/completions", url.scheme: "", otel.name: POST, otel.kind: Server, span.type: "web", http.route: "/v1/completions", otel.name: "POST /v1/completions"
openai_trtllm-1 |
openai_trtllm-1 | 2024-04-10T03:31:53.762164Z DEBUG tower::buffer::worker: service.ready: true, processing request
openai_trtllm-1 | at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197 on tokio-runtime-worker
openai_trtllm-1 | in openai_trtllm::routes::completions::non-streaming completions with headers: {"host": "127.0.0.1:3000", "user-agent": "curl/8.2.1", "accept": "*/*", "content-type": "application/json", "content-length": "117"}
openai_trtllm-1 | in openai_trtllm::routes::completions::completions with headers: {"host": "127.0.0.1:3000", "user-agent": "curl/8.2.1", "accept": "*/*", "content-type": "application/json", "content-length": "117"}
openai_trtllm-1 | in otel::tracing::HTTP request with http.request.method: POST, network.protocol.version: 1.1, server.address: "127.0.0.1:3000", user_agent.original: "curl/8.2.1", url.path: "/v1/completions", url.scheme: "", otel.name: POST, otel.kind: Server, span.type: "web", http.route: "/v1/completions", otel.name: "POST /v1/completions"
openai_trtllm-1 |
openai_trtllm-1 | 2024-04-10T03:31:53.762537Z DEBUG h2::codec::framed_write: send, frame: Headers { stream_id: StreamId(3), flags: (0x4: END_HEADERS) }
openai_trtllm-1 | at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213 on tokio-runtime-worker
openai_trtllm-1 | in h2::proto::connection::Connection with peer: Client
openai_trtllm-1 |
openai_trtllm-1 | 2024-04-10T03:31:53.762616Z DEBUG h2::codec::framed_write: send, frame: Data { stream_id: StreamId(3) }
openai_trtllm-1 | at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213 on tokio-runtime-worker
openai_trtllm-1 | in h2::proto::connection::Connection with peer: Client
openai_trtllm-1 |
openai_trtllm-1 | 2024-04-10T03:31:53.762642Z DEBUG h2::codec::framed_write: send, frame: Data { stream_id: StreamId(3), flags: (0x1: END_STREAM) }
openai_trtllm-1 | at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213 on tokio-runtime-worker
openai_trtllm-1 | in h2::proto::connection::Connection with peer: Client
openai_trtllm-1 |
openai_trtllm-1 | 2024-04-10T03:31:53.762899Z DEBUG h2::codec::framed_read: received, frame: Ping { ack: false, payload: [0, 0, 0, 0, 0, 0, 0, 1] }
openai_trtllm-1 | at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405 on tokio-runtime-worker
openai_trtllm-1 | in h2::proto::connection::Connection with peer: Client
openai_trtllm-1 |
openai_trtllm-1 | 2024-04-10T03:31:53.762944Z DEBUG h2::codec::framed_write: send, frame: Ping { ack: true, payload: [0, 0, 0, 0, 0, 0, 0, 1] }
openai_trtllm-1 | at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213 on tokio-runtime-worker
openai_trtllm-1 | in h2::proto::connection::Connection with peer: Client
openai_trtllm-1 |
openai_trtllm-1 | 2024-04-10T03:32:00.505616Z DEBUG h2::codec::framed_read: received, frame: Headers { stream_id: StreamId(3), flags: (0x4: END_HEADERS) }
openai_trtllm-1 | at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405 on tokio-runtime-worker
openai_trtllm-1 | in h2::proto::connection::Connection with peer: Client
openai_trtllm-1 |
openai_trtllm-1 | 2024-04-10T03:32:00.505727Z DEBUG h2::codec::framed_read: received, frame: Data { stream_id: StreamId(3) }
openai_trtllm-1 | at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405 on tokio-runtime-worker
openai_trtllm-1 | in h2::proto::connection::Connection with peer: Client
openai_trtllm-1 |
openai_trtllm-1 | 2024-04-10T03:32:00.505758Z DEBUG h2::codec::framed_read: received, frame: WindowUpdate { stream_id: StreamId(0), size_increment: 329 }
openai_trtllm-1 | at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405 on tokio-runtime-worker
openai_trtllm-1 | in h2::proto::connection::Connection with peer: Client
openai_trtllm-1 |
openai_trtllm-1 | 2024-04-10T03:32:00.505786Z DEBUG h2::codec::framed_read: received, frame: Headers { stream_id: StreamId(3), flags: (0x5: END_HEADERS | END_STREAM) }
openai_trtllm-1 | at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405 on tokio-runtime-worker
openai_trtllm-1 | in h2::proto::connection::Connection with peer: Client
openai_trtllm-1 |
openai_trtllm-1 | 2024-04-10T03:32:00.506087Z DEBUG openai_trtllm::routes::completions: triton infer response: ModelInferResponse { model_name: "ensemble", model_version: "1", id: "", parameters: {"sequence_id": InferParameter { parameter_choice: Some(Int64Param(0)) }, "sequence_end": InferParameter { parameter_choice: Some(BoolParam(false)) }, "sequence_start": InferParameter { parameter_choice: Some(BoolParam(false)) }, "triton_final_response": InferParameter { parameter_choice: Some(BoolParam(true)) }}, outputs: [InferOutputTensor { name: "text_output", datatype: "BYTES", shape: [1], parameters: {}, contents: None }], raw_output_contents: [[25, 0, 0, 0, 87, 104, 97, 116, 32, 105, 115, 32, 109, 97, 99, 104, 105, 110, 101, 32, 108, 101, 97, 114, 110, 105, 110, 103, 63]] }
openai_trtllm-1 | at src/routes/completions.rs:166 on tokio-runtime-worker
openai_trtllm-1 | in openai_trtllm::routes::completions::non-streaming completions with headers: {"host": "127.0.0.1:3000", "user-agent": "curl/8.2.1", "accept": "*/*", "content-type": "application/json", "content-length": "117"}
openai_trtllm-1 | in openai_trtllm::routes::completions::completions with headers: {"host": "127.0.0.1:3000", "user-agent": "curl/8.2.1", "accept": "*/*", "content-type": "application/json", "content-length": "117"}
openai_trtllm-1 | in otel::tracing::HTTP request with http.request.method: POST, network.protocol.version: 1.1, server.address: "127.0.0.1:3000", user_agent.original: "curl/8.2.1", url.path: "/v1/completions", url.scheme: "", otel.name: POST, otel.kind: Server, span.type: "web", http.route: "/v1/completions", otel.name: "POST /v1/completions"
openai_trtllm-1 |
openai_trtllm-1 | 2024-04-10T03:32:00.506152Z DEBUG openai_trtllm::routes::completions: deserialized triton infer response content: "What is machine learning?"
openai_trtllm-1 | at src/routes/completions.rs:170 on tokio-runtime-worker
openai_trtllm-1 | in openai_trtllm::routes::completions::non-streaming completions with headers: {"host": "127.0.0.1:3000", "user-agent": "curl/8.2.1", "accept": "*/*", "content-type": "application/json", "content-length": "117"}
openai_trtllm-1 | in openai_trtllm::routes::completions::completions with headers: {"host": "127.0.0.1:3000", "user-agent": "curl/8.2.1", "accept": "*/*", "content-type": "application/json", "content-length": "117"}
openai_trtllm-1 | in otel::tracing::HTTP request with http.request.method: POST, network.protocol.version: 1.1, server.address: "127.0.0.1:3000", user_agent.original: "curl/8.2.1", url.path: "/v1/completions", url.scheme: "", otel.name: POST, otel.kind: Server, span.type: "web", http.route: "/v1/completions", otel.name: "POST /v1/completions"
@npuichigo I have found the source of the problem. The parameter "top_p" is being sent to the gRPC, and this parameter is set as default in your code. If I comment out this parameter, the answer will be normal. I am still trying to determine what is causing it.
Thanks for catching the bug. I keep the default value from openai, maybe not suitable for in such a intermediate library. Consider remove unconfident default values in a future patch.
Thank you for all your hard work. I will close this issue now.
First of all, thank you for your excellent work. The model I deployed was codellama. I built this project, and two containers started. However, when I used the OpenAI API, it didn't output anything while the GPU was actually working.
I ran the following Python example:
But nothing outputted.
I ran the following curl command:
The response looked like this:
and the log info looks normal:
However, when I used curl in the container of triton_trt_llm:
The answer was just like normal. I'm wondering if any parameters that I input were wrong or if there is any documentation to guide me on what to do next.
Thanks for your kind work again!