triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
648 stars 93 forks source link

request was blocked when gpt_model_type=inflight_fused_batching, serving baichuan model #164

Open burling opened 9 months ago

burling commented 9 months ago

Hello,

I am currently experiencing an issue with the triton-inference-server/tensorrt_backend while trying to run a Baichuan model.

Description

I have set gpt_model_type=inflight_fused_batching in my model configuration, but when I send a request to the server on port 8000, the request stays in processing indefinitely, with no output whatsoever.

Triton Information

I use the latest commit from main branch(e8ae70c583f8353a7dfebb1b424326a633b9360e). Here is my GPU device info:

image

To Reproduce

Steps to reproduce the behavior:

  1. Set gpt_model_type=inflight_fused_batching in model configuration.

  2. Send a request to the Triton server on port 8000.

  3. Observe that the request stays in processing with no output.

    image
  4. Some info may related using pstack

    image

I would expect the server to process the request.

Thank you for your help.

byshiue commented 9 months ago

I couldn't reproduce this issue. Will continue investigating.

amir1m commented 7 months ago

hI @byshiue , I am facing similar issue (if not exactly same).

After running curl the command prompt just hangs.

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'

amir1m commented 7 months ago

Hi @burling , Were you able to resolve the issue?

Thanks.

dwq370 commented 1 week ago

tensorrt-llm v0.10.0 meet the same issue