Open tonyctalope opened 1 month ago
First Update:
Second Update:
Still in progress
Question that I had
Started
Update Three:
from vllm import LLM
import torch
memory_stats_before = torch.cuda.memory_stats()
model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
llm = LLM(model=model_id, enforce_eager=True)
memory_stats = torch.cuda.memory_stats()
print(f"GPU allocated memory freed: {int((memory_stats['allocated_bytes.all.freed'] - memory_stats_before['allocated_bytes.all.freed'])/1024/1024)} Mib")
If we don't generate any output the "GPU allocated memory freed" would correspond to the internal profile done by vllm model_runner.profile_run()
to determine how much KV block can be allocated without impacting the model execution. We could also propose a change to vllm a change to vllm repo to log the peak_memory
https://vscode.dev/github/vllm-project/vllm/blob/main/vllm/worker/worker.py#L187
Based on your recent commits, it looks like you found solutions to parts 1 and 2, so I'm going to ignore them.
Update Three: 4. To test Model memory consomation using fp8 quantization we could do the following
from vllm import LLM import torch memory_stats_before = torch.cuda.memory_stats() model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-FP8" llm = LLM(model=model_id, enforce_eager=True) memory_stats = torch.cuda.memory_stats() print(f"GPU allocated memory freed: {int((memory_stats['allocated_bytes.all.freed'] - memory_stats_before['allocated_bytes.all.freed'])/1024/1024)} Mib")
If we don't generate any output the "GPU allocated memory freed" would correspond to the internal profile done by vllm
model_runner.profile_run()
to determine how much KV block can be allocated without impacting the model execution. We could also propose a change to vllm a change to vllm repo to log the peak_memoryhttps://vscode.dev/github/vllm-project/vllm/blob/main/vllm/worker/worker.py#L187
I tried to use your piece of code in the following order of commands:
docker run --gpus all -it nvcr.io/nvidia/pytorch:23.07-py3
apt update && apt install -y python3-venv
mkdir vllm & cd vllm
python3 -m venv venv
source venv/bin/activate
pip install vllm
running your code -> KeyError: 'allocated_bytes.all.freed'
Additionally, I want to clarify how vLLM works:
max_model_len
parameter.Also, if you need benchmarks for specific models, let me know the metrics, models, and context length for each one, and I can provide you those informations.
Add newest GPUs cards:
Modify Huggingface configuration handling:
Add newest quantization types:
Implement result verifications and tests:
Tasks: