vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.52k stars 4.43k forks source link

[Usage]: 125m parameter model is also showing CUDA: Out of memory error in a Nvidia16GB 4060 #8136

Closed shubh9m closed 1 month ago

shubh9m commented 2 months ago

How would you like to use vllm

Even for a smaller model like "facebook/opt-125m" when I am trying to do multiprocessing(even with batch size of 2) on a single 16GB Nvidia 4060, I am encountering CUDA: OUT OF MEMORY ERROR. When I am running the same model sequentially, I am able to run it fine. Can you explain this?

Before submitting a new issue...

jeejeelee commented 2 months ago

You can try reducing gpu_memory_utilization

shubh9m commented 2 months ago

I have tried that. The sequential code is taking 0.3 GB only. With multiprocessing and batch size of only 2, it is taking more than 16 GB. That's the part I am unable to understand.

DarkLight1337 commented 1 month ago

Make sure the GPU is not already used by another process.

shubh9m commented 1 month ago

It is completely free.

DarkLight1337 commented 1 month ago

Can you show the code that you use to perform multiprocessing?

shubh9m commented 1 month ago

from vllm import LLM, SamplingParams from typing import List, Dict import multiprocessing

Function to initialize the LLM in each process

def init_llm(): return LLM(model="distilgpt2", enforce_eager=True)

Define LLMPredictor class

class LLMPredictor: def init(self, llm): self.llm = llm

def predict(self, texts: List[str]) -> List[Dict[str, str]]:
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    outputs = self.llm.generate(texts, sampling_params)
    results = []
    for output in outputs:
        result = {
            "prompt": output.prompt,
            "generated_text": ' '.join([o.text for o in output.outputs])
        }
        results.append(result)
    return results

Function to process a batch of texts

def process_batch(batch: List[str]) -> List[Dict[str, str]]: llm = init_llm() predictor = LLMPredictor(llm) return predictor.predict(batch)

Read the text file

def read_text_file(file_path: str) -> List[str]: with open(file_path, 'r') as file: lines = file.readlines() return [line.strip() for line in lines][:30]

Load your text file

texts = read_text_file('/content/prompt300.txt') # Modify to your text file path

Split data into batches

batch_size = 2 batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)]

Perform distributed inference using ProcessPoolExecutor

if name == 'main': with multiprocessing.Pool() as pool: results = pool.map(process_batch, batches)

# Flatten the list of results
flat_results = [item for sublist in results for item in sublist]

# Print results
# for result in flat_results:
#     print(f"Prompt: {result['prompt']}")
#     print(f"Generated Text: {result['generated_text']}")
#     print("="*50)  # Separator for readability
DarkLight1337 commented 1 month ago

How many processes are being created in this way? Since the processes are not managed by vLLM, each process will allocate the GPU separately. So, your gpu_memory_utilization should be at most 1 / num_processes per process, otherwise the combined usage will exceed 100%.

shubh9m commented 1 month ago

I have tried limiting the number of processes to 2, and also gpu_memory_utillization to 0.4, but now the error is:

No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.

DarkLight1337 commented 1 month ago

Just to make sure, can you run this with number of processes = 1 using the same code?

shubh9m commented 1 month ago

Yep, it is working when I set it to 1. Does vllm have an inbuilt class for doing the same kind of multiprocessing as here I am doing it explicitly?

DarkLight1337 commented 1 month ago

@youkaichao imo this is somewhat unconventional usage but can you take a look at this?

youkaichao commented 1 month ago

this is not the recommended usage. creating LLM instance in every process is error-prone.

I would suggest just spin up an openai api server, and just use web request to get the result.

shubh9m commented 1 month ago

Is there any way other than using openai api server or is it possible to have a single LLM instance for multiple processes?

youkaichao commented 1 month ago

why not openai api server?