Open xiaoToby opened 2 months ago
@xiaoToby looking at that repo, it looks like it's just using fastapi. I think you may have some luck if you give it a shot, but my biggest concern is that you may run into some rambling with INSTRUCT model if you don't manually account for the prompt template changes that we note here in llama-recipes.
If you're looking for a quick spin up server, then another option is using the latest vLLM, which is working with Llama 3 already. This will give you an quick way to spin up a server and then you can easily use curl
to hit it.
@xiaoToby looking at that repo, it looks like it's just using fastapi. I think you may have some luck if you give it a shot, but my biggest concern is that you may run into some rambling with INSTRUCT model if you don't manually account for the prompt template changes that we note here in llama-recipes.
If you're looking for a quick spin up server, then another option is using the latest vLLM, which is working with Llama 3 already. This will give you an quick way to spin up a server and then you can easily use
curl
to hit it.
thanks for your help,it helps a lot! But there is an obstacle, the requirements of CUDA's version and Python's version for vllm are strict. So it's not very convenient to use.
@WoosukKwon - can you help with this question on vLLM?
Hi @jspisak, thanks for letting me know the issue!
@xiaoToby Which CUDA and Python versions are you using? You can simply install vLLM by running pip install vllm
. It will work for Python 3.8 - 3.11, which is the same as the Python versions supported by PyTorch. As for the CUDA version, the pypi wheels use CUDA 12.1 and can run on machine with NVIDIA driver >= 530.30.02 (you don't need to install CUDA SDK). Also, we provide CUDA 11.8 wheels in our release.
Hi @jspisak, thanks for letting me know the issue!
@xiaoToby Which CUDA and Python versions are you using? You can simply install vLLM by running
pip install vllm
. It will work for Python 3.8 - 3.11, which is the same as the Python versions supported by PyTorch. As for the CUDA version, the pypi wheels use CUDA 12.1 and can run on machine with NVIDIA driver >= 530.30.02 (you don't need to install CUDA SDK). Also, we provide CUDA 11.8 wheels in our release.
I tried to use vLLM on docker image (nvidia/cuda:12.2.2-devel-ubuntu22.04), it works fine.
I am very urgently want to use LLama3 in this way (https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/tree/main/scripts/openai_server_demo)
My questions is: Can I use LLama3 in the same file, just change download the models and change the model name on the file ?
@jspisak @astonzhang @gitkwr @ruanslv @HamidShojanazeri