triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
664 stars 96 forks source link

tritonserver crash (SIGNAL 11) when Opentelemetry trace is enable for trtllm backend #371

Closed npuichigo closed 5 months ago

npuichigo commented 6 months ago

System Info

GPU: H100 image: nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 trtllm version: 0.8.0

Who can help?

@kaiyux @byshiue @schetlur-nv

Information

Tasks

Reproduction

The TensorRT-LLM model I use is Baichuan, and I follow the official guidance to do model conversion

# Obtain and start the basic docker image environment.
docker run --rm --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.1.0-devel-ubuntu22.04

# Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

# Install the latest preview version (corresponding to the main branch) of TensorRT-LLM.
# If you want to install the stable version (corresponding to the release branch), please
# remove the `--pre` option.
pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

# Clone Baichuan2 model
git clone -b v2.0 https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat

# Use fp8
python3 TensorRT-LLM/examples/quantization/quantize.py \
       --model_dir ./Baichuan2-13B-Chat \
       --dtype float16 \
       --qformat fp8 \
       --output_dir ./quantized_fp8 \
       --calib_size 256

trtllm-build --checkpoint_dir ./quantized_fp8 \
             --output_dir ./trt_engines/baichuan_v2_13b_fp8_bs32_4096_1024/ \
             --gemm_plugin float16 \
             --max_batch_size 32 \
             --max_input_len 4096 \
             --max_output_len 1024

Follow the official guidance of https://github.com/triton-inference-server/tensorrtllm_backend/tree/v0.8.0/all_models/inflight_batcher_llm to build an ensemble for my model.

Also, since Baichuan needs sentencepiece python package and I follow https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#creating-custom-execution-environments to build a python execution env with sentencepiece installed.

After that, the final model repository looks like:

├── ensemble
│   ├── 1
│   └── config.pbtxt
├── postprocessing
│   ├── 1
│   │   ├── model.py
│   │   └── __pycache__
│   ├── Baichuan2-13B-Chat-Tokenizer
│   │   ├── special_tokens_map.json
│   │   ├── tokenization_baichuan.py
│   │   ├── tokenizer_config.json
│   │   └── tokenizer.model
│   ├── config.pbtxt
│   └── custom-py
│       ├── bin
│       ├── compiler_compat
│       ├── conda-meta
│       ├── include
│       ├── lib
│       ├── man
│       ├── share
│       ├── ssl
│       ├── x86_64-conda_cos7-linux-gnu
│       └── x86_64-conda-linux-gnu
├── preprocessing
│   ├── 1
│   │   ├── model.py
│   │   └── __pycache__
│   ├── Baichuan2-13B-Chat-Tokenizer
│   │   ├── special_tokens_map.json
│   │   ├── tokenization_baichuan.py
│   │   ├── tokenizer_config.json
│   │   └── tokenizer.model
│   ├── config.pbtxt
│   └── custom-py
│       ├── bin
│       ├── compiler_compat
│       ├── conda-meta
│       ├── include
│       ├── lib
│       ├── man
│       ├── share
│       ├── ssl
│       ├── x86_64-conda_cos7-linux-gnu
│       └── x86_64-conda-linux-gnu
└── tensorrt_llm
    ├── 1
    │   ├── config.json
    │   └── rank0.engine
    └── config.pbtxt

Now I launch the tritonserver with opentelemetry configured like

tritonserver --model-repository=/models \
                   --trace-config mode=opentelemetry \
                   --trace-config opentelemetry,url=http://opentelemetry-collector.observability.svc.cluster.local:4318/v1/traces

and call with W3C header attached to trigger trace:

curl -i -H "traceparent: 00-80e1afed08e019fc1110464cfa66635c-7a085853722dc6d2-01" -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "who are u?", "max_tokens": 256, "bad_words": "", "stop_words": ""}'

Now, the server crashes like:

I0310 06:25:38.962361 1 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8001
I0310 06:25:38.962557 1 http_server.cc:4637] Started HTTPService at 0.0.0.0:8000
I0310 06:25:39.003694 1 http_server.cc:320] Started Metrics Service at 0.0.0.0:8002
Signal (11) received.
 0# 0x000055815783D8AD in tritonserver
 1# 0x00007F9D4E52F520 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007F9D4EF36BB5 in /opt/tritonserver/bin/../lib/libtritonserver.so
 3# 0x00007F9D4F03F199 in /opt/tritonserver/bin/../lib/libtritonserver.so
 4# 0x00007F9D4F057D2E in /opt/tritonserver/bin/../lib/libtritonserver.so
 5# 0x00007F9D4EF50CEE in /opt/tritonserver/bin/../lib/libtritonserver.so
 6# 0x00007F9D4EFB9E52 in /opt/tritonserver/bin/../lib/libtritonserver.so
 7# 0x00007F9D4F094D98 in /opt/tritonserver/bin/../lib/libtritonserver.so
 8# 0x00007F9D4EF60D59 in /opt/tritonserver/bin/../lib/libtritonserver.so
 9# 0x00007F9D4EF68A4F in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x00007F9D4EFB9E52 in /opt/tritonserver/bin/../lib/libtritonserver.so
11# 0x00007F9D4F094D98 in /opt/tritonserver/bin/../lib/libtritonserver.so
12# TRITONSERVER_ServerInferAsync in /opt/tritonserver/bin/../lib/libtritonserver.so
13# 0x000055815799CBDC in tritonserver
14# 0x000055815799FBEB in tritonserver
15# 0x0000558157F55BA5 in tritonserver
16# 0x0000558157F5A405 in tritonserver
17# 0x0000558157F587BE in tritonserver
18# 0x0000558157F67820 in tritonserver
19# 0x0000558157F70150 in tritonserver
20# 0x0000558157F70BC7 in tritonserver
21# 0x0000558157F5C792 in tritonserver
22# 0x00007F9D4E581AC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
23# 0x00007F9D4E613850 in /usr/lib/x86_64-linux-gnu/libc.so.6

Expected behavior

Expected behavior Since it's an ensemble model, I tested part of it like preprocessing to validate the tracing works for at least python backend

curl -i -H "traceparent: 00-80e1afed08e019fc1110464cfa66635c-7a085853722dc6d2-01" -X POST localhost/v2/models/preprocessing/generate -d '{"QUERY": "who are u?", "REQUEST_OUTPUT_LEN": 20}'

actual behavior

I0310 06:25:38.962361 1 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8001
I0310 06:25:38.962557 1 http_server.cc:4637] Started HTTPService at 0.0.0.0:8000
I0310 06:25:39.003694 1 http_server.cc:320] Started Metrics Service at 0.0.0.0:8002
Signal (11) received.
 0# 0x000055815783D8AD in tritonserver
 1# 0x00007F9D4E52F520 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007F9D4EF36BB5 in /opt/tritonserver/bin/../lib/libtritonserver.so
 3# 0x00007F9D4F03F199 in /opt/tritonserver/bin/../lib/libtritonserver.so
 4# 0x00007F9D4F057D2E in /opt/tritonserver/bin/../lib/libtritonserver.so
 5# 0x00007F9D4EF50CEE in /opt/tritonserver/bin/../lib/libtritonserver.so
 6# 0x00007F9D4EFB9E52 in /opt/tritonserver/bin/../lib/libtritonserver.so
 7# 0x00007F9D4F094D98 in /opt/tritonserver/bin/../lib/libtritonserver.so
 8# 0x00007F9D4EF60D59 in /opt/tritonserver/bin/../lib/libtritonserver.so
 9# 0x00007F9D4EF68A4F in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x00007F9D4EFB9E52 in /opt/tritonserver/bin/../lib/libtritonserver.so
11# 0x00007F9D4F094D98 in /opt/tritonserver/bin/../lib/libtritonserver.so
12# TRITONSERVER_ServerInferAsync in /opt/tritonserver/bin/../lib/libtritonserver.so
13# 0x000055815799CBDC in tritonserver
14# 0x000055815799FBEB in tritonserver
15# 0x0000558157F55BA5 in tritonserver
16# 0x0000558157F5A405 in tritonserver
17# 0x0000558157F587BE in tritonserver
18# 0x0000558157F67820 in tritonserver
19# 0x0000558157F70150 in tritonserver
20# 0x0000558157F70BC7 in tritonserver
21# 0x0000558157F5C792 in tritonserver
22# 0x00007F9D4E581AC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
23# 0x00007F9D4E613850 in /usr/lib/x86_64-linux-gnu/libc.so.6

additional notes

opentelemetry endpoint is an opentelemetry-collector to accept traces

npuichigo commented 6 months ago

Update: if --trace-config level=TIMESTAMPS is provided, it works fine. With default --trace-config level=OFF, the request just hang. Then after ctrl-c and try again, the server crashes.

oandreeva-nv commented 6 months ago

Please, make sure to start opentelemetry tracing with --trace-config level=TIMESTAMPS , since by default it is OFF. SegFault issue will be fixed in triton starting 24.03, but if you don't specify level, spans will not be generated and sent from triton side.

oandreeva-nv commented 5 months ago

This issue should be fixed in 24.03