Closed shubh9m closed 1 month ago
You can try reducing gpu_memory_utilization
I have tried that. The sequential code is taking 0.3 GB only. With multiprocessing and batch size of only 2, it is taking more than 16 GB. That's the part I am unable to understand.
Make sure the GPU is not already used by another process.
It is completely free.
Can you show the code that you use to perform multiprocessing?
from vllm import LLM, SamplingParams from typing import List, Dict import multiprocessing
def init_llm(): return LLM(model="distilgpt2", enforce_eager=True)
class LLMPredictor: def init(self, llm): self.llm = llm
def predict(self, texts: List[str]) -> List[Dict[str, str]]:
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = self.llm.generate(texts, sampling_params)
results = []
for output in outputs:
result = {
"prompt": output.prompt,
"generated_text": ' '.join([o.text for o in output.outputs])
}
results.append(result)
return results
def process_batch(batch: List[str]) -> List[Dict[str, str]]: llm = init_llm() predictor = LLMPredictor(llm) return predictor.predict(batch)
def read_text_file(file_path: str) -> List[str]: with open(file_path, 'r') as file: lines = file.readlines() return [line.strip() for line in lines][:30]
texts = read_text_file('/content/prompt300.txt') # Modify to your text file path
batch_size = 2 batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)]
if name == 'main': with multiprocessing.Pool() as pool: results = pool.map(process_batch, batches)
# Flatten the list of results
flat_results = [item for sublist in results for item in sublist]
# Print results
# for result in flat_results:
# print(f"Prompt: {result['prompt']}")
# print(f"Generated Text: {result['generated_text']}")
# print("="*50) # Separator for readability
How many processes are being created in this way? Since the processes are not managed by vLLM, each process will allocate the GPU separately. So, your gpu_memory_utilization
should be at most 1 / num_processes
per process, otherwise the combined usage will exceed 100%.
I have tried limiting the number of processes to 2, and also gpu_memory_utillization to 0.4, but now the error is:
No available memory for the cache blocks. Try increasing
gpu_memory_utilization
when initializing the engine.
Just to make sure, can you run this with number of processes = 1 using the same code?
Yep, it is working when I set it to 1. Does vllm have an inbuilt class for doing the same kind of multiprocessing as here I am doing it explicitly?
@youkaichao imo this is somewhat unconventional usage but can you take a look at this?
this is not the recommended usage. creating LLM instance in every process is error-prone.
I would suggest just spin up an openai api server, and just use web request to get the result.
Is there any way other than using openai api server or is it possible to have a single LLM instance for multiple processes?
why not openai api server?
How would you like to use vllm
Even for a smaller model like "facebook/opt-125m" when I am trying to do multiprocessing(even with batch size of 2) on a single 16GB Nvidia 4060, I am encountering CUDA: OUT OF MEMORY ERROR. When I am running the same model sequentially, I am able to run it fine. Can you explain this?
Before submitting a new issue...