npuichigo / openai_trtllm

OpenAI compatible API for TensorRT LLM triton backend
MIT License
176 stars 27 forks source link

Output nothing but the gpu was working #37

Closed jaywongs closed 7 months ago

jaywongs commented 8 months ago

First of all, thank you for your excellent work. The model I deployed was codellama. I built this project, and two containers started. However, when I used the OpenAI API, it didn't output anything while the GPU was actually working.

I ran the following Python example:

python3 openai_completion_stream.py

But nothing outputted.

I ran the following curl command:

curl --location 'http://127.0.0.1:3000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "ensemble",
    "messages": [
        {
            "role": "user",
            "content": "What is machine learning?"
        }
    ],
    "max_tokens": 200,
    "temperature": 0,
    "stop": [""]
}'

The response looked like this:

{
    "id": "cmpl-8959e6ff-41c7-4abf-b9eb-38489480af04",
    "object": "text_completion",
    "created": 1712479707,
    "model": "ensemble",
    "system_fingerprint": null,
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "User: What is machine learning?\nASSISTANT:"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 0,
        "completion_tokens": 0,
        "total_tokens": 0
    }
}

and the log info looks normal:

{
    "timestamp": "2024-04-07T09:00:51.437413Z",
    "level": "INFO",
    "message": "request: Json(ChatCompletionCreateParams { messages: [User { content: \"What is machine learning?\", name: None }], model: \"ensemble\", frequency_penalty: 0.0, logit_bias: None, max_tokens: 200, n: 1, presence_penalty: 0.0, response_format: None, seed: None, stop: Some([\"\"]), stream: false, temperature: 0.0, top_p: 1.0, user: None })",
    "target": "openai_trtllm::routes::chat",
    "span": {
        "headers": "{\"host\": \"127.0.0.1:3000\", \"user-agent\": \"curl/8.2.1\", \"accept\": \"*/*\", \"content-type\": \"application/json\", \"content-length\": \"220\"}",
        "name": "chat_completions"
    },
    "spans": [
        {
            "http.request.method": "POST",
            "http.route": "/v1/chat/completions",
            "network.protocol.version": "1.1",
            "otel.kind": "Server",
            "otel.name": "POST /v1/chat/completions",
            "server.address": "127.0.0.1:3000",
            "span.type": "web",
            "url.path": "/v1/chat/completions",
            "url.scheme": "",
            "user_agent.original": "curl/8.2.1",
            "name": "HTTP request"
        },
        {
            "headers": "{\"host\": \"127.0.0.1:3000\", \"user-agent\": \"curl/8.2.1\", \"accept\": \"*/*\", \"content-type\": \"application/json\", \"content-length\": \"220\"}",
            "name": "chat_completions"
        }
    ]
}

However, when I used curl in the container of triton_trt_llm:

curl -X POST localhost:8000/v2/models/ensemble/generate_stream -d '{
    "text_input": "What is machine learning?",
    "max_tokens": 200,
    "stream": false,
    "bad_words": "",
    "stop_words": "",
    "pad_id": 2,
    "end_id": 2,
    "return_log_probs": false
}'

The answer was just like normal. I'm wondering if any parameters that I input were wrong or if there is any documentation to guide me on what to do next.

Thanks for your kind work again!

npuichigo commented 8 months ago

I think you should use the gRPC port of triton. Maybe it's 3001 in your case?

jaywongs commented 7 months ago

I think you should use the gRPC port of triton. Maybe it's 3001 in your case?

I used the default gRPC port (8001) to communicate with Triton.

I0407 07:25:59.905119 103 server.cc:677] 
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| ensemble         | 1       | READY  |
| postprocessing   | 1       | READY  |
| preprocessing    | 1       | READY  |
| tensorrt_llm     | 1       | READY  |
| tensorrt_llm_bls | 1       | READY  |
+------------------+---------+--------+

I0407 07:25:59.966723 103 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA A100-SXM4-80GB
I0407 07:25:59.969556 103 metrics.cc:770] Collecting CPU metrics
I0407 07:25:59.969690 103 tritonserver.cc:2508] 
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.43.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /tensorrtllm_backend/triton_model_repo                                                                                                                                                                          |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 1                                                                                                                                                                                                               |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0407 07:25:59.971120 103 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8001
I0407 07:25:59.971319 103 http_server.cc:4637] Started HTTPService at 0.0.0.0:8000
I0407 07:26:00.012131 103 http_server.cc:320] Started Metrics Service at 0.0.0.0:8002

and my docker-compose.yml as following:

version: "3"

services:
  openai_trtllm:
    image: openai_trtllm
    build:
      context: .
      dockerfile: Dockerfile
    command:
      - "--host"
      - "0.0.0.0"
      - "--port"
      - "3000"
      - "--triton-endpoint"
      - "http://tensorrtllm_backend:8001"
    ports:
      - "3000:3000"
    depends_on:
      - tensorrtllm_backend
    restart: on-failure

  # Triton backend for TensorRT LLM
  tensorrtllm_backend:
    image: triton_trt_llm:latest
    volumes:
      - /tensor/tensorrtllm_backend:/tensorrtllm_backend
    command: 
      - "/bin/bash"
      - "-c"
      - |
        python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo
        tail -f /dev/null
    ports:
      - "8000:8000"
      - "8001:8001"
      - "8002:8002"
    deploy:
      replicas: 1
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [ gpu ]
    shm_size: '2g'
    ulimits:
      memlock: -1
      stack: 67108864
    restart: on-failure

i think the port was right here, maybe?

npuichigo commented 7 months ago

could u try this https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/end_to_end_grpc_client.py to check the ensemle?

jaywongs commented 7 months ago

could u try this https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/end_to_end_grpc_client.py to check the ensemle?

The output of 'end_to_end_grpc_client.py' appears to be correct.

root@94b0bf360805:/tensorrtllm_backend/inflight_batcher_llm/client# python3 end_to_end_grpc_client.py  --output-len 100         --prompt "What is machine learning?"
[b'What is machine learning?\n\nMachine learning is a field of computer science that uses statistical techniques to give computers the ability to "learn" (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed.\n\n### What is supervised learning?\n\nSupervised learning is a type of machine learning algorithm used for regression and classification problems. In supervised learning, we are given a set of data points {x<sub>1</sub>, x<sub']

root@94b0bf360805:/tensorrtllm_backend/inflight_batcher_llm/client# python3 end_to_end_grpc_client.py  --output-len 100         --prompt "What is machine learning?" --streaming

Machinelearningisafieldofcomputersciencethatusesstatisticaltechniquestogivecomputerstheabilityto"learn"(i.e.,progressivelyimproveperformanceonaspecifictask)withdata,withoutbeingexplicitlyprogrammed.

###Whatissupervisedlearning?

Supervisedlearningisatypeofmachinelearningalgorithmusedforregressionandclassificationproblems.Insupervisedlearning,wearegivenasetofdatapoints{x<sub>1</sub>,x<sub

The streaming mode output lacked spaces, as mentioned in the issue regarding tensorrt-llm-background.

npuichigo commented 7 months ago

Can u use completion instead of chat completion to make sure the inputs are same to compare?And it's better to set RUST_LOG to debug.

I tested codellama and it works for me.

jaywongs commented 7 months ago

Can u use completion instead of chat completion to make sure the inputs are same to compare?And it's better to set RUST_LOG to debug.

I tested codellama and it works for me.

thanks for your patient, i open the debug mode and try with /v1/completion, here is the log in detail:

curl --location 'http://127.0.0.1:3000/v1/completions' \
> --header 'Content-Type: application/json' \
> --data '{
>     "model": "ensemble",
>     "prompt": "What is machine learning?",
>     "max_tokens": 200,
>     "temperature": 0
>   }'
{"id":"cmpl-8f38b74e-1630-43c9-9586-f9e7d1e89345","object":"text_completion","created":1712719920,"model":"ensemble","choices":[{"text":"What is machine learning?","index":0,"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

and the debug log:

openai_trtllm-1        |   2024-04-10T03:31:53.761914Z  INFO openai_trtllm::routes::completions: request: Json(CompletionCreateParams { model: "ensemble", prompt: ["What is machine learning?"], best_of: 1, echo: false, frequency_penalty: 0.0, logit_bias: None, logprobs: None, max_tokens: 200, n: 1, presence_penalty: 0.0, seed: None, stop: None, stream: false, suffix: None, temperature: 0.0, top_p: 1.0, user: None })
openai_trtllm-1        |     at src/routes/completions.rs:35 on tokio-runtime-worker
openai_trtllm-1        |     in openai_trtllm::routes::completions::completions with headers: {"host": "127.0.0.1:3000", "user-agent": "curl/8.2.1", "accept": "*/*", "content-type": "application/json", "content-length": "117"}
openai_trtllm-1        |     in otel::tracing::HTTP request with http.request.method: POST, network.protocol.version: 1.1, server.address: "127.0.0.1:3000", user_agent.original: "curl/8.2.1", url.path: "/v1/completions", url.scheme: "", otel.name: POST, otel.kind: Server, span.type: "web", http.route: "/v1/completions", otel.name: "POST /v1/completions"
openai_trtllm-1        | 
openai_trtllm-1        |   2024-04-10T03:31:53.762164Z DEBUG tower::buffer::worker: service.ready: true, processing request
openai_trtllm-1        |     at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197 on tokio-runtime-worker
openai_trtllm-1        |     in openai_trtllm::routes::completions::non-streaming completions with headers: {"host": "127.0.0.1:3000", "user-agent": "curl/8.2.1", "accept": "*/*", "content-type": "application/json", "content-length": "117"}
openai_trtllm-1        |     in openai_trtllm::routes::completions::completions with headers: {"host": "127.0.0.1:3000", "user-agent": "curl/8.2.1", "accept": "*/*", "content-type": "application/json", "content-length": "117"}
openai_trtllm-1        |     in otel::tracing::HTTP request with http.request.method: POST, network.protocol.version: 1.1, server.address: "127.0.0.1:3000", user_agent.original: "curl/8.2.1", url.path: "/v1/completions", url.scheme: "", otel.name: POST, otel.kind: Server, span.type: "web", http.route: "/v1/completions", otel.name: "POST /v1/completions"
openai_trtllm-1        | 
openai_trtllm-1        |   2024-04-10T03:31:53.762537Z DEBUG h2::codec::framed_write: send, frame: Headers { stream_id: StreamId(3), flags: (0x4: END_HEADERS) }
openai_trtllm-1        |     at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213 on tokio-runtime-worker
openai_trtllm-1        |     in h2::proto::connection::Connection with peer: Client
openai_trtllm-1        | 
openai_trtllm-1        |   2024-04-10T03:31:53.762616Z DEBUG h2::codec::framed_write: send, frame: Data { stream_id: StreamId(3) }
openai_trtllm-1        |     at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213 on tokio-runtime-worker
openai_trtllm-1        |     in h2::proto::connection::Connection with peer: Client
openai_trtllm-1        | 
openai_trtllm-1        |   2024-04-10T03:31:53.762642Z DEBUG h2::codec::framed_write: send, frame: Data { stream_id: StreamId(3), flags: (0x1: END_STREAM) }
openai_trtllm-1        |     at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213 on tokio-runtime-worker
openai_trtllm-1        |     in h2::proto::connection::Connection with peer: Client
openai_trtllm-1        | 
openai_trtllm-1        |   2024-04-10T03:31:53.762899Z DEBUG h2::codec::framed_read: received, frame: Ping { ack: false, payload: [0, 0, 0, 0, 0, 0, 0, 1] }
openai_trtllm-1        |     at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405 on tokio-runtime-worker
openai_trtllm-1        |     in h2::proto::connection::Connection with peer: Client
openai_trtllm-1        | 
openai_trtllm-1        |   2024-04-10T03:31:53.762944Z DEBUG h2::codec::framed_write: send, frame: Ping { ack: true, payload: [0, 0, 0, 0, 0, 0, 0, 1] }
openai_trtllm-1        |     at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213 on tokio-runtime-worker
openai_trtllm-1        |     in h2::proto::connection::Connection with peer: Client
openai_trtllm-1        | 
openai_trtllm-1        |   2024-04-10T03:32:00.505616Z DEBUG h2::codec::framed_read: received, frame: Headers { stream_id: StreamId(3), flags: (0x4: END_HEADERS) }
openai_trtllm-1        |     at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405 on tokio-runtime-worker
openai_trtllm-1        |     in h2::proto::connection::Connection with peer: Client
openai_trtllm-1        | 
openai_trtllm-1        |   2024-04-10T03:32:00.505727Z DEBUG h2::codec::framed_read: received, frame: Data { stream_id: StreamId(3) }
openai_trtllm-1        |     at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405 on tokio-runtime-worker
openai_trtllm-1        |     in h2::proto::connection::Connection with peer: Client
openai_trtllm-1        | 
openai_trtllm-1        |   2024-04-10T03:32:00.505758Z DEBUG h2::codec::framed_read: received, frame: WindowUpdate { stream_id: StreamId(0), size_increment: 329 }
openai_trtllm-1        |     at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405 on tokio-runtime-worker
openai_trtllm-1        |     in h2::proto::connection::Connection with peer: Client
openai_trtllm-1        | 
openai_trtllm-1        |   2024-04-10T03:32:00.505786Z DEBUG h2::codec::framed_read: received, frame: Headers { stream_id: StreamId(3), flags: (0x5: END_HEADERS | END_STREAM) }
openai_trtllm-1        |     at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405 on tokio-runtime-worker
openai_trtllm-1        |     in h2::proto::connection::Connection with peer: Client
openai_trtllm-1        | 
openai_trtllm-1        |   2024-04-10T03:32:00.506087Z DEBUG openai_trtllm::routes::completions: triton infer response: ModelInferResponse { model_name: "ensemble", model_version: "1", id: "", parameters: {"sequence_id": InferParameter { parameter_choice: Some(Int64Param(0)) }, "sequence_end": InferParameter { parameter_choice: Some(BoolParam(false)) }, "sequence_start": InferParameter { parameter_choice: Some(BoolParam(false)) }, "triton_final_response": InferParameter { parameter_choice: Some(BoolParam(true)) }}, outputs: [InferOutputTensor { name: "text_output", datatype: "BYTES", shape: [1], parameters: {}, contents: None }], raw_output_contents: [[25, 0, 0, 0, 87, 104, 97, 116, 32, 105, 115, 32, 109, 97, 99, 104, 105, 110, 101, 32, 108, 101, 97, 114, 110, 105, 110, 103, 63]] }
openai_trtllm-1        |     at src/routes/completions.rs:166 on tokio-runtime-worker
openai_trtllm-1        |     in openai_trtllm::routes::completions::non-streaming completions with headers: {"host": "127.0.0.1:3000", "user-agent": "curl/8.2.1", "accept": "*/*", "content-type": "application/json", "content-length": "117"}
openai_trtllm-1        |     in openai_trtllm::routes::completions::completions with headers: {"host": "127.0.0.1:3000", "user-agent": "curl/8.2.1", "accept": "*/*", "content-type": "application/json", "content-length": "117"}
openai_trtllm-1        |     in otel::tracing::HTTP request with http.request.method: POST, network.protocol.version: 1.1, server.address: "127.0.0.1:3000", user_agent.original: "curl/8.2.1", url.path: "/v1/completions", url.scheme: "", otel.name: POST, otel.kind: Server, span.type: "web", http.route: "/v1/completions", otel.name: "POST /v1/completions"
openai_trtllm-1        | 
openai_trtllm-1        |   2024-04-10T03:32:00.506152Z DEBUG openai_trtllm::routes::completions: deserialized triton infer response content: "What is machine learning?"
openai_trtllm-1        |     at src/routes/completions.rs:170 on tokio-runtime-worker
openai_trtllm-1        |     in openai_trtllm::routes::completions::non-streaming completions with headers: {"host": "127.0.0.1:3000", "user-agent": "curl/8.2.1", "accept": "*/*", "content-type": "application/json", "content-length": "117"}
openai_trtllm-1        |     in openai_trtllm::routes::completions::completions with headers: {"host": "127.0.0.1:3000", "user-agent": "curl/8.2.1", "accept": "*/*", "content-type": "application/json", "content-length": "117"}
openai_trtllm-1        |     in otel::tracing::HTTP request with http.request.method: POST, network.protocol.version: 1.1, server.address: "127.0.0.1:3000", user_agent.original: "curl/8.2.1", url.path: "/v1/completions", url.scheme: "", otel.name: POST, otel.kind: Server, span.type: "web", http.route: "/v1/completions", otel.name: "POST /v1/completions"
jaywongs commented 7 months ago

@npuichigo I have found the source of the problem. The parameter "top_p" is being sent to the gRPC, and this parameter is set as default in your code. If I comment out this parameter, the answer will be normal. I am still trying to determine what is causing it.

npuichigo commented 7 months ago

Thanks for catching the bug. I keep the default value from openai, maybe not suitable for in such a intermediate library. Consider remove unconfident default values in a future patch.

jaywongs commented 7 months ago

Thank you for all your hard work. I will close this issue now.