VLLM from langchain.llms

vipulgote1999 commented 1 year ago

System Info

VLLM from langchain gives the below error and stops executing:

code: `python from langchain.llms import VLLM

llm = VLLM(model="facebook/opt-125m" tensor_parallel_size=2, trust_remote_code=True, # mandatory for hf models max_new_tokens=128, top_k=10, top_p=0.95, temperature=0.8, )

print(llm("What is the capital of France ?")) ` gives the below error when setting tensor_parallel_size=2 and runs successfully if we comment out tensor_parallel_size argument: error => 2023-08-23 08:52:55,683 ERROR services.py:1207 -- Failed to start the dashboard, return code -11 2023-08-23 08:52:55,685 ERROR services.py:1232 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is. 2023-08-23 08:52:55,687 ERROR services.py:1276 -- The last 20 lines of /tmp/ray/session_2023-08-23_08-52-52_882632_28/logs/dashboard.log (it contains the error message from the dashboard): 2023-08-23 08:52:55,607 INFO head.py:242 -- Starting dashboard metrics server on port 44227

2023-08-23 08:52:56,847 INFO worker.py:1636 -- Started a local Ray instance.

VM Details: CPU : 4 Core RAM : 13 GB GPU: Nvidia T4 *2 Instance : Kaggle Kernal

Who can help?

No response

Information

[ ] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[X] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

code:

`python from langchain.llms import VLLM

llm = VLLM(model="facebook/opt-125m" tensor_parallel_size=2, trust_remote_code=True, # mandatory for hf models max_new_tokens=128, top_k=10, top_p=0.95, temperature=0.8, )

print(llm("What is the capital of France ?")) `

On Kaggle Kernal

VM Details: CPU : 4 Core RAM : 13 GB GPU: Nvidia T4 *2 Instance : Kaggle Kernal

Expected behavior

2023-08-23 08:52:55,683 ERROR services.py:1207 -- Failed to start the dashboard , return code -11 2023-08-23 08:52:55,685 ERROR services.py:1232 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is. 2023-08-23 08:52:55,687 ERROR services.py:1276 -- The last 20 lines of /tmp/ray/session_2023-08-23_08-52-52_882632_28/logs/dashboard.log (it contains the error message from the dashboard): 2023-08-23 08:52:55,607 INFO head.py:242 -- Starting dashboard metrics server on port 44227

2023-08-23 08:52:56,847 INFO worker.py:1636 -- Started a local Ray instance.

and Cell stops running i.e. execution stops

dosubot[bot] commented 1 year ago

🤖

Hello,

The error you're encountering is likely due to the tensor_parallel_size argument in the VLLM class being set to a value greater than the number of GPUs available on your Kaggle Kernel's VM. The tensor_parallel_size argument specifies the number of GPUs to use for distributed tensor-parallel inference. If the Kaggle Kernel's VM only has one GPU, setting tensor_parallel_size=2 would cause an error because there aren't enough GPUs available to distribute the workload.

To resolve this issue, you should set tensor_parallel_size to a value less than or equal to the number of GPUs available on your VM. If you're unsure of the number of GPUs on your VM, you can use the torch.cuda.device_count() function from the PyTorch library to check.

Here's an example of how to do this:

import torch
from langchain.llms import VLLM

num_gpus = torch.cuda.device_count()

llm = VLLM(model="mosaicml/mpt-30b",
           tensor_parallel_size=num_gpus,
           trust_remote_code=True,  # mandatory for hf models
)

llm("What is the future of AI?")

In this code, num_gpus is the number of GPUs available on your VM. The tensor_parallel_size argument is set to num_gpus, ensuring that you're not trying to use more GPUs than are available.

For more information, you can refer to the VLLM notebook in the LangChain repository.

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

vipulgote1999 commented 1 year ago

Still it got auto killed

dosubot[bot] commented 11 months ago

Hi, @vipulgote1999! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, the issue you reported was related to setting tensor_parallel_size=2 in the VLLM from langchain.llms. It seems that the error you encountered was due to tensor_parallel_size being set to a value greater than the number of GPUs available on the Kaggle Kernel's VM. A suggested solution was to set tensor_parallel_size to a value less than or equal to the number of GPUs available. However, you mentioned that the issue still persists and provided a screenshot of the error.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on this issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository, and please let us know if you have any further questions or concerns!

langchain-ai / langchain