tonyctalope / gpu_poor

Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization
https://rahulschand.github.io/gpu_poor/
0 stars 1 forks source link

Up to date GPT_POOR #1

Open tonyctalope opened 1 month ago

tonyctalope commented 1 month ago

Add newest GPUs cards:

Modify Huggingface configuration handling:

Add newest quantization types:

Implement result verifications and tests:

Tasks:

  1. Add support for the following GPU cards:
    • [x] h100
    • [x] h200?
    • [x] a100
    • [x] l40s
  2. Modify the configuration handling to fetch Huggingface configs via API calls.
    • [x] Remove local storage of Huggingface configs.
    • [x] Implement API calls to gather Huggingface configs.
  3. Add support for the fp8 quantization type.
    • [ ] Implement fp8/16/32 quantization type.
  4. Add result verifications and tests.
    • [ ] Define verification criteria.
    • [ ] Implement verifications.
    • [ ] Write and run tests to ensure result accuracy.
juulieen commented 1 month ago

First Update:

  1. ✅ Question that I had during the test:
    • What's the structure of the project/ How it's working?
    • Where/How should I find the information about the GPU ?
  2. ✅ Question that I had during the test:
    • Cors are gonna be an issue ?
    • Which API should I use the hf hub api does not include all the config needed for the memory calculation
    • Should I refactor now?
  3. Just started
  4. Not started
juulieen commented 1 month ago

Second Update:

  1. Still in progress
    Question that I had

    • What the current implementation of the quantization method in gpu_poor?
    • What's is the fp8 quantization/ How does it work? I've narrowed my research to the fp8 documentation of vLLM which mention the following (I've decided to put aside FBGEMM FP8 for now): image Despite this I'm still not sure of how it translate in the estimation calculation. That why I've started to look for 4.
  2. Started

    • I've looked for running the model using vllm library and using torch to look at the memory allocated, but it's a dead end because the memory allocated does not directly represent the memory needed to run the model image
    • I've looked for running the model using vllm server and checking for the prometheus metrics, I haven't found what I want still :/
juulieen commented 1 month ago

Update Three:

  1. To test Model memory consomation using fp8 quantization we could do the following
    from vllm import LLM
    import torch
    memory_stats_before = torch.cuda.memory_stats()
    model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
    llm = LLM(model=model_id, enforce_eager=True)
    memory_stats = torch.cuda.memory_stats()
    print(f"GPU allocated memory freed: {int((memory_stats['allocated_bytes.all.freed'] - memory_stats_before['allocated_bytes.all.freed'])/1024/1024)} Mib")

    If we don't generate any output the "GPU allocated memory freed" would correspond to the internal profile done by vllm model_runner.profile_run() to determine how much KV block can be allocated without impacting the model execution. We could also propose a change to vllm a change to vllm repo to log the peak_memory

https://vscode.dev/github/vllm-project/vllm/blob/main/vllm/worker/worker.py#L187

C0casio45 commented 1 month ago

What's is the fp8 quantization/ How does it work?

You'll have a lot of information about the working schema of fp8 here , you also have an example here in python, feel free to check other sources, I may miss some interesting ones ...

tonyctalope commented 1 month ago

Based on your recent commits, it looks like you found solutions to parts 1 and 2, so I'm going to ignore them.


Update Three: 4. To test Model memory consomation using fp8 quantization we could do the following

from vllm import LLM
import torch
memory_stats_before = torch.cuda.memory_stats()
model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
llm = LLM(model=model_id, enforce_eager=True)
memory_stats = torch.cuda.memory_stats()
print(f"GPU allocated memory freed: {int((memory_stats['allocated_bytes.all.freed'] - memory_stats_before['allocated_bytes.all.freed'])/1024/1024)} Mib")

If we don't generate any output the "GPU allocated memory freed" would correspond to the internal profile done by vllm model_runner.profile_run() to determine how much KV block can be allocated without impacting the model execution. We could also propose a change to vllm a change to vllm repo to log the peak_memory

https://vscode.dev/github/vllm-project/vllm/blob/main/vllm/worker/worker.py#L187

I tried to use your piece of code in the following order of commands:

docker run --gpus all -it nvcr.io/nvidia/pytorch:23.07-py3
apt update && apt install -y python3-venv
mkdir vllm & cd vllm
python3 -m venv venv
source venv/bin/activate
pip install vllm
running your code -> KeyError: 'allocated_bytes.all.freed'

Additionally, I want to clarify how vLLM works:

Also, if you need benchmarks for specific models, let me know the metrics, models, and context length for each one, and I can provide you those informations.