Open sourabh-patil opened 2 days ago
It seems the issue is related to 2080 (turing arch, mentioning this as I think flash attention does not support turing arch, but ideally it should work without flash attention as facebook opt model is working without it). I tested the same things on 3080 (ampere arch) and it works. Let me know if there is some fix to do it on 2080. Thanks! (if there is no solution for 2080 then you may close the issue)
It looks like you're encountering two primary issues when trying to host "meta-llama/Llama-3.2-1B-Instruct"
using the vLLM backend on your RTX 2080:
Let's address each issue step by step.
The error message you're seeing:
/tmp/tmpkmxyr0hd/main.c:5:10: fatal error: Python.h: No such file or directory
5 | #include <Python.h>
| ^~~~~~~~~~
compilation terminated.
indicates that the Python.h
header file is missing. This header is part of the Python development package, which is required for compiling Python C extensions—something that many machine learning libraries rely on.
To resolve this issue, you'll need to install the Python development headers. Here's how you can do it based on your operating system:
For Ubuntu/Debian:
sudo apt-get update
sudo apt-get install python3-dev
After installing the development headers, try running your server code again. This should resolve the Python.h
not found error.
From your initial attempt, you received the following message:
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA GeForce RTX 2080 Ti GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
This suggests that your GPU does not support bfloat16
precision but does support float16
(also known as half
precision). When I executed the model with half precision using --dtype=half
, I received the following error:
Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
indicates that an operation is being attempted that is only supported on NVIDIA Ampere GPUs. Your RTX 2080 Ti is based on the Turing.
float32
)To avoid these hardware compatibility issues, you can run the model using full precision (float32
). This will use more GPU memory but should bypass the errors related to unsupported operations on your GPU.
Modify your command as follows:
python3 server.py --model meta-llama/Llama-3.2-1B-Instruct --gpu-memory-utilization 0.8 --host XXXX --port 8008 --dtype=float32
Note: Ensure that your GPU has enough memory to handle the increased memory requirements of float32
precision.
To ensure that all dependencies are correctly installed and to replicate an environment that is known to work, you might consider using Docker. This can help avoid issues related to system libraries and binary compatibility.
Here's a Docker command that sets up an appropriate environment:
docker run -ti \
--gpus all \
--network=host \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
nvcr.io/nvidia/pytorch:24.09-py3 \
bash
Explanation of Flags:
--gpus all
: Grants the container access to all GPUs.--network=host
: Uses the host network stack.--ipc=host
: Shares the host's IPC namespace, which can improve performance.--ulimit memlock=-1
and --ulimit stack=67108864
: Adjusts container limits to allow larger memory allocations, which some deep learning frameworks require.nvcr.io/nvidia/pytorch:24.09-py3
: Specifies the NVIDIA PyTorch Docker image.Once inside the container, install your Python dependencies:
pip install nvidia-pytriton
In example folder, you can run the install script:
bash install.sh
Then, run your server as you normally would:
python3 server.py --model meta-llama/Llama-3.2-1B-Instruct --gpu-memory-utilization 0.8 --host XXXX --port 8008 --dtype=float32
Note: Using Docker is optional but can greatly simplify environment management, especially for complex machine learning setups.
After your server is running, you can test the model using curl
:
curl http://localhost:8008/v2/models/meta_llama_Llama_3.2_1B_Instruct/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "San Francisco is a",
"max_tokens": 128,
"temperature": 0
}'
You should receive a coherent response generated by the model:
{"model_name":"meta_llama_Llama_3.2_1B_Instruct","model_version":"1","text":"San Francisco is a city that is full of life, energy, and diversity. From its iconic Golden Gate Bridge to its vibrant neighborhoods, there's always something new to explore. Here are some of the top things to do in San Francisco:\n\n**Must-see attractions:**\n\n1. **Golden Gate Bridge**: An iconic symbol of San Francisco, this suspension bridge offers stunning views of the city and the bay.\n2. **Alcatraz Island**: Take a ferry to this former prison turned national park, where you can learn about its infamous history.\n3. **Fisherman's Wharf**: This bustling waterfront district is perfect for seafood, street performers, and"}
Python.h
not found error.float32
): This avoids precision-related compatibility issues with your RTX 2080 Ti.Let me know if you have any questions or need more assistance!
Description
I want to host "meta-llama/Llama-3.2-1B-Instruct" using vllm backend on pytrion server. I can run other models like "facebook_opt_350m" using the same server code shared in this repo.
I am using example shared at https://github.com/triton-inference-server/pytriton/blob/main/examples/vllm/server.py
It shows end message which suggests that it has been hosted successfully I1029 06:22:31.765535 431086 model_lifecycle.cc:839] "successfully loaded 'meta_llama_Llama_3.2_1B_Instruct'", but when I try to get response using curl it fails. It works successfully in case of "facebook_opt_350m" though.
To reproduce
Run
https://github.com/triton-inference-server/pytriton/blob/main/examples/vllm/server.py
with following arguments
Using half as without half precision (with Bfloat16) it showed me this message
Observed results and expected behavior
Pasting first few lines of error code
It seems that it's throwing error from vllm side while doing forward pass on model.
Environment
Is it issue related to my GPU? (NVIDIA GeForce RTX 2080 Ti, it has Turing architecture I think)