shubh9m commented 2 months ago

How would you like to use vllm

Even for a smaller model like "facebook/opt-125m" when I am trying to do multiprocessing(even with batch size of 2) on a single 16GB Nvidia 4060, I am encountering CUDA: OUT OF MEMORY ERROR. When I am running the same model sequentially, I am able to run it fine. Can you explain this?

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

jeejeelee commented 2 months ago

You can try reducing gpu_memory_utilization

shubh9m commented 2 months ago

I have tried that. The sequential code is taking 0.3 GB only. With multiprocessing and batch size of only 2, it is taking more than 16 GB. That's the part I am unable to understand.

DarkLight1337 commented 1 month ago

Make sure the GPU is not already used by another process.

shubh9m commented 1 month ago

It is completely free.

DarkLight1337 commented 1 month ago

Can you show the code that you use to perform multiprocessing?

shubh9m commented 1 month ago

from vllm import LLM, SamplingParams from typing import List, Dict import multiprocessing

Function to initialize the LLM in each process

def init_llm(): return LLM(model="distilgpt2", enforce_eager=True)

Define LLMPredictor class

class LLMPredictor: def init(self, llm): self.llm = llm

def predict(self, texts: List[str]) -> List[Dict[str, str]]:
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    outputs = self.llm.generate(texts, sampling_params)
    results = []
    for output in outputs:
        result = {
            "prompt": output.prompt,
            "generated_text": ' '.join([o.text for o in output.outputs])
        }
        results.append(result)
    return results

Function to process a batch of texts

def process_batch(batch: List[str]) -> List[Dict[str, str]]: llm = init_llm() predictor = LLMPredictor(llm) return predictor.predict(batch)

Read the text file

def read_text_file(file_path: str) -> List[str]: with open(file_path, 'r') as file: lines = file.readlines() return [line.strip() for line in lines][:30]

Load your text file

texts = read_text_file('/content/prompt300.txt') # Modify to your text file path

Split data into batches

batch_size = 2 batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)]

Perform distributed inference using ProcessPoolExecutor

if name == 'main': with multiprocessing.Pool() as pool: results = pool.map(process_batch, batches)

# Flatten the list of results
flat_results = [item for sublist in results for item in sublist]

# Print results
# for result in flat_results:
#     print(f"Prompt: {result['prompt']}")
#     print(f"Generated Text: {result['generated_text']}")
#     print("="*50)  # Separator for readability

DarkLight1337 commented 1 month ago

How many processes are being created in this way? Since the processes are not managed by vLLM, each process will allocate the GPU separately. So, your gpu_memory_utilization should be at most 1 / num_processes per process, otherwise the combined usage will exceed 100%.

shubh9m commented 1 month ago

I have tried limiting the number of processes to 2, and also gpu_memory_utillization to 0.4, but now the error is:

No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.

DarkLight1337 commented 1 month ago

Just to make sure, can you run this with number of processes = 1 using the same code?

shubh9m commented 1 month ago

Yep, it is working when I set it to 1. Does vllm have an inbuilt class for doing the same kind of multiprocessing as here I am doing it explicitly?

DarkLight1337 commented 1 month ago

@youkaichao imo this is somewhat unconventional usage but can you take a look at this?

youkaichao commented 1 month ago

this is not the recommended usage. creating LLM instance in every process is error-prone.

I would suggest just spin up an openai api server, and just use web request to get the result.

shubh9m commented 1 month ago

Is there any way other than using openai api server or is it possible to have a single LLM instance for multiple processes?

youkaichao commented 1 month ago

why not openai api server?

vllm-project / vllm

[Usage]: 125m parameter model is also showing CUDA: Out of memory error in a Nvidia16GB 4060 #8136

How would you like to use vllm

Before submitting a new issue...

Function to initialize the LLM in each process

Define LLMPredictor class

Function to process a batch of texts

Read the text file

Load your text file

Split data into batches

Perform distributed inference using ProcessPoolExecutor