Can not host "meta-llama/Llama-3.2-1B-Instruct" using vllm backend on RTX 2080

Description

I want to host "meta-llama/Llama-3.2-1B-Instruct" using vllm backend on pytrion server. I can run other models like "facebook_opt_350m" using the same server code shared in this repo.

I am using example shared at https://github.com/triton-inference-server/pytriton/blob/main/examples/vllm/server.py

It shows end message which suggests that it has been hosted successfully I1029 06:22:31.765535 431086 model_lifecycle.cc:839] "successfully loaded 'meta_llama_Llama_3.2_1B_Instruct'", but when I try to get response using curl it fails. It works successfully in case of "facebook_opt_350m" though.

To reproduce

Run

https://github.com/triton-inference-server/pytriton/blob/main/examples/vllm/server.py

with following arguments

 python3 server.py --model meta-llama/Llama-3.2-1B-Instruct --gpu-memory-utilization 0.8 --host XXXX --port 8008 --dtype=half

Using half as without half precision (with Bfloat16) it showed me this message

 ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA GeForce RTX 2080 Ti GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

Observed results and expected behavior

Pasting first few lines of error code

INFO 10-29 11:54:00 async_llm_engine.py:207] Added request 46167d8f167740089be3eda8344c0565.
/tmp/tmpkmxyr0hd/main.c:5:10: fatal error: Python.h: No such file or directory
    5 | #include <Python.h>
      |          ^~~~~~~~~~
compilation terminated.
INFO 10-29 11:54:00 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241029-115400.pkl...
WARNING 10-29 11:54:00 model_runner_base.py:143] Failed to pickle inputs of failed execution: Can't pickle local object 'weak_bind.<locals>.weak_bound'
ERROR 10-29 11:54:00 async_llm_engine.py:64] Engine background task failed
ERROR 10-29 11:54:00 async_llm_engine.py:64] Traceback (most recent call last):
ERROR 10-29 11:54:00 async_llm_engine.py:64]   File "/media/HDD4TB1/sourabh/llama3.2/env_llama32/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 10-29 11:54:00 async_llm_engine.py:64]     return func(*args, **kwargs)
ERROR 10-29 11:54:00 async_llm_engine.py:64]   File "/media/HDD4TB1/sourabh/llama3.2/env_llama32/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1658, in execute_model
ERROR 10-29 11:54:00 async_llm_engine.py:64]     hidden_or_intermediate_states = model_executable(
ERROR 10-29 11:54:00 async_llm_engine.py:64]   File "/media/HDD4TB1/sourabh/llama3.2/env_llama32/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-29 11:54:00 async_llm_engine.py:64]     return self._call_impl(*args, **kwargs)
ERROR 10-29 11:54:00 async_llm_engine.py:64]   File "/media/HDD4TB1/sourabh/llama3.2/env_llama32/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-29 11:54:00 async_llm_engine.py:64]     return forward_call(*args, **kwargs)
ERROR 10-29 11:54:00 async_llm_engine.py:64]   File "/media/HDD4TB1/sourabh/llama3.2/env_llama32/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 558, in forward
ERROR 10-29 11:54:00 async_llm_engine.py:64]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 10-29 11:54:00 async_llm_engine.py:64]   File "/media/HDD4TB1/sourabh/llama3.2/env_llama32/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-29 11:54:00 async_llm_engine.py:64]     return self._call_impl(*args, **kwargs)
ERROR 10-29 11:54:00 async_llm_engine.py:64]   File "/media/HDD4TB1/sourabh/llama3.2/env_llama32/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-29 11:54:00 async_llm_engine.py:64]     return forward_call(*args, **kwargs)
ERROR 10-29 11:54:00 async_llm_engine.py:64]   File "/media/HDD4TB1/sourabh/llama3.2/env_llama32/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 347, in forward
ERROR 10-29 11:54:00 async_llm_engine.py:64]     hidden_states, residual = layer(positions, hidden_states,

It seems that it's throwing error from vllm side while doing forward pass on model.

Environment

OS: Ububtu 22.04
Python 3.10.12

Package                           Version
--------------------------------- -------------
aiohappyeyeballs                  2.4.3
aiohttp                           3.10.10
aiosignal                         1.3.1
annotated-types                   0.7.0
anyio                             4.6.2.post1
async-timeout                     4.0.3
attrs                             24.2.0
Brotli                            1.1.0
certifi                           2024.8.30
charset-normalizer                3.4.0
click                             8.1.7
cloudpickle                       3.1.0
compressed-tensors                0.6.0
datasets                          3.0.2
dill                              0.3.8
diskcache                         5.6.3
distro                            1.9.0
einops                            0.8.0
exceptiongroup                    1.2.2
fastapi                           0.115.4
filelock                          3.16.1
frozenlist                        1.5.0
fsspec                            2024.9.0
gevent                            24.10.3
geventhttpclient                  2.0.2
gguf                              0.10.0
greenlet                          3.1.1
grpcio                            1.67.0
h11                               0.14.0
httpcore                          1.0.6
httptools                         0.6.4
httpx                             0.27.2
huggingface-hub                   0.26.2
idna                              3.10
importlib_metadata                8.5.0
interegular                       0.3.3
Jinja2                            3.1.4
jiter                             0.6.1
jsonschema                        4.23.0
jsonschema-specifications         2024.10.1
lark                              1.2.2
llvmlite                          0.43.0
lm-format-enforcer                0.10.6
markdown-it-py                    3.0.0
MarkupSafe                        3.0.2
mdurl                             0.1.2
mistral_common                    1.4.4
mpmath                            1.3.0
msgpack                           1.1.0
msgspec                           0.18.6
multidict                         6.1.0
multiprocess                      0.70.16
mypy-extensions                   1.0.0
nest-asyncio                      1.6.0
networkx                          3.4.2
numba                             0.60.0
numpy                             1.26.4
nvidia-cublas-cu12                12.1.3.1
nvidia-cuda-cupti-cu12            12.1.105
nvidia-cuda-nvrtc-cu12            12.1.105
nvidia-cuda-runtime-cu12          12.1.105
nvidia-cudnn-cu12                 9.1.0.70
nvidia-cufft-cu12                 11.0.2.54
nvidia-curand-cu12                10.3.2.106
nvidia-cusolver-cu12              11.4.5.107
nvidia-cusparse-cu12              12.1.0.106
nvidia-ml-py                      12.560.30
nvidia-nccl-cu12                  2.20.5
nvidia-nvjitlink-cu12             12.6.77
nvidia-nvtx-cu12                  12.1.105
nvidia-pytriton                   0.5.12
openai                            1.52.2
opencv-python-headless            4.10.0.84
outlines                          0.0.46
packaging                         24.1
pandas                            2.2.3
partial-json-parser               0.2.1.1.post4
pillow                            10.4.0
pip                               22.0.2
prometheus_client                 0.21.0
prometheus-fastapi-instrumentator 7.0.0
propcache                         0.2.0
protobuf                          4.25.5
psutil                            6.1.0
py-cpuinfo                        9.0.0
pyairports                        2.1.1
pyarrow                           18.0.0
pycountry                         24.6.1
pydantic                          2.9.2
pydantic_core                     2.23.4
Pygments                          2.18.0
python-dateutil                   2.9.0.post0
python-dotenv                     1.0.1
python-rapidjson                  1.20
pytz                              2024.2
PyYAML                            6.0.2
pyzmq                             26.2.0
ray                               2.38.0
referencing                       0.35.1
regex                             2024.9.11
requests                          2.32.3
rich                              13.9.3
rpds-py                           0.20.0
safetensors                       0.4.5
sentencepiece                     0.2.0
setuptools                        59.6.0
sh                                2.1.0
shellingham                       1.5.4
six                               1.16.0
sniffio                           1.3.1
starlette                         0.41.2
sympy                             1.13.3
tiktoken                          0.7.0
tokenizers                        0.20.1
torch                             2.4.0
torchvision                       0.19.0
tqdm                              4.66.6
transformers                      4.46.0
triton                            3.0.0
tritonclient                      2.51.0
typer                             0.12.5
typing_extensions                 4.12.2
typing-inspect                    0.9.0
tzdata                            2024.2
urllib3                           2.2.3
uvicorn                           0.32.0
uvloop                            0.21.0
vllm                              0.6.3.post1
watchfiles                        0.24.0
websockets                        13.1
wrapt                             1.16.0
xformers                          0.0.27.post2
xxhash                            3.5.0
yarl                              1.17.0
zipp                              3.20.2
zope.event                        5.0
zope.interface                    7.1.1

Is it issue related to my GPU? (NVIDIA GeForce RTX 2080 Ti, it has Turing architecture I think)

It seems the issue is related to 2080 (turing arch, mentioning this as I think flash attention does not support turing arch, but ideally it should work without flash attention as facebook opt model is working without it). I tested the same things on 3080 (ampere arch) and it works. Let me know if there is some fix to do it on 2080. Thanks! (if there is no solution for 2080 then you may close the issue)

It looks like you're encountering two primary issues when trying to host "meta-llama/Llama-3.2-1B-Instruct" using the vLLM backend on your RTX 2080:

Missing Python Development Headers
Precision Compatibility with Your GPU

Let's address each issue step by step.

1. Missing Python Development Headers

The error message you're seeing:

/tmp/tmpkmxyr0hd/main.c:5:10: fatal error: Python.h: No such file or directory
    5 | #include <Python.h>
      |          ^~~~~~~~~~
compilation terminated.

indicates that the Python.h header file is missing. This header is part of the Python development package, which is required for compiling Python C extensions—something that many machine learning libraries rely on.

Solution: Install Python Development Headers

To resolve this issue, you'll need to install the Python development headers. Here's how you can do it based on your operating system:

For Ubuntu/Debian:

sudo apt-get update
sudo apt-get install python3-dev

After installing the development headers, try running your server code again. This should resolve the Python.h not found error.

2. Precision Compatibility with Your GPU

From your initial attempt, you received the following message:

ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA GeForce RTX 2080 Ti GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

This suggests that your GPU does not support bfloat16 precision but does support float16 (also known as half precision). When I executed the model with half precision using --dtype=half, I received the following error:

Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.

indicates that an operation is being attempted that is only supported on NVIDIA Ampere GPUs. Your RTX 2080 Ti is based on the Turing.

Solution: Use Full Precision (`float32`)

To avoid these hardware compatibility issues, you can run the model using full precision (float32). This will use more GPU memory but should bypass the errors related to unsupported operations on your GPU.

Modify your command as follows:

python3 server.py --model meta-llama/Llama-3.2-1B-Instruct --gpu-memory-utilization 0.8 --host XXXX --port 8008 --dtype=float32

Note: Ensure that your GPU has enough memory to handle the increased memory requirements of float32 precision.

3. Using Docker for a Consistent Environment (Optional)

To ensure that all dependencies are correctly installed and to replicate an environment that is known to work, you might consider using Docker. This can help avoid issues related to system libraries and binary compatibility.

Docker Command

Here's a Docker command that sets up an appropriate environment:

docker run -ti \
    --gpus all \
    --network=host \
    --ipc=host \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    nvcr.io/nvidia/pytorch:24.09-py3 \
    bash

Explanation of Flags:

--gpus all: Grants the container access to all GPUs.
--network=host: Uses the host network stack.
--ipc=host: Shares the host's IPC namespace, which can improve performance.
--ulimit memlock=-1 and --ulimit stack=67108864: Adjusts container limits to allow larger memory allocations, which some deep learning frameworks require.
nvcr.io/nvidia/pytorch:24.09-py3: Specifies the NVIDIA PyTorch Docker image.

Inside the Docker Container

Once inside the container, install your Python dependencies:

pip install nvidia-pytriton

In example folder, you can run the install script:

bash install.sh

Then, run your server as you normally would:

python3 server.py --model meta-llama/Llama-3.2-1B-Instruct --gpu-memory-utilization 0.8 --host XXXX --port 8008 --dtype=float32

Note: Using Docker is optional but can greatly simplify environment management, especially for complex machine learning setups.

4. Testing the Model

After your server is running, you can test the model using curl:

curl http://localhost:8008/v2/models/meta_llama_Llama_3.2_1B_Instruct/generate \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "San Francisco is a",
        "max_tokens": 128,
        "temperature": 0
    }'

You should receive a coherent response generated by the model:

{"model_name":"meta_llama_Llama_3.2_1B_Instruct","model_version":"1","text":"San Francisco is a city that is full of life, energy, and diversity. From its iconic Golden Gate Bridge to its vibrant neighborhoods, there's always something new to explore. Here are some of the top things to do in San Francisco:\n\n**Must-see attractions:**\n\n1. **Golden Gate Bridge**: An iconic symbol of San Francisco, this suspension bridge offers stunning views of the city and the bay.\n2. **Alcatraz Island**: Take a ferry to this former prison turned national park, where you can learn about its infamous history.\n3. **Fisherman's Wharf**: This bustling waterfront district is perfect for seafood, street performers, and"}

Summary

Install Python Development Headers: This resolves the Python.h not found error.
Use Full Precision (float32): This avoids precision-related compatibility issues with your RTX 2080 Ti.
Consider Using Docker: While optional, Docker can ensure a consistent environment and alleviate system-specific issues.

Let me know if you have any questions or need more assistance!

triton-inference-server / pytriton