Up to date GPT_POOR - Githubissues

tonyctalope commented 1 month ago

Add newest GPUs cards:

h100
h200?
a100
l40s

Modify Huggingface configuration handling:

Instead of storing the Huggingface config locally, gather them from an API call.

Add newest quantization types:

fp8/16/32
float bfloat ?

Implement result verifications and tests:

Add verifications for the results.
Implement and run tests to ensure the correctness of the results.

Tasks:

Add support for the following GPU cards:
- [x] h100
- [x] h200?
- [x] a100
- [x] l40s
Modify the configuration handling to fetch Huggingface configs via API calls.
- [x] Remove local storage of Huggingface configs.
- [x] Implement API calls to gather Huggingface configs.
Add support for the fp8 quantization type.
- [ ] Implement fp8/16/32 quantization type.
Add result verifications and tests.
- [ ] Define verification criteria.
- [ ] Implement verifications.
- [ ] Write and run tests to ensure result accuracy.

juulieen commented 1 month ago

First Update:

✅ Question that I had during the test:
- What's the structure of the project/ How it's working?
- Where/How should I find the information about the GPU ?
✅ Question that I had during the test:
- Cors are gonna be an issue ?
- Which API should I use the hf hub api does not include all the config needed for the memory calculation
- Should I refactor now?
Just started
Not started

juulieen commented 1 month ago

Second Update:

Still in progress
Question that I had
- What the current implementation of the quantization method in gpu_poor?
- What's is the fp8 quantization/ How does it work? I've narrowed my research to the fp8 documentation of vLLM which mention the following (I've decided to put aside FBGEMM FP8 for now): Despite this I'm still not sure of how it translate in the estimation calculation. That why I've started to look for 4.
Started
- I've looked for running the model using vllm library and using torch to look at the memory allocated, but ~~it's a dead end~~ because the memory allocated does not directly represent the memory needed to run the model
- I've looked for running the model using vllm server and checking for the prometheus metrics, I haven't found what I want still :/

juulieen commented 1 month ago

Update Three:

To test Model memory consomation using fp8 quantization we could do the following
```
from vllm import LLM
import torch
memory_stats_before = torch.cuda.memory_stats()
model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
llm = LLM(model=model_id, enforce_eager=True)
memory_stats = torch.cuda.memory_stats()
print(f"GPU allocated memory freed: {int((memory_stats['allocated_bytes.all.freed'] - memory_stats_before['allocated_bytes.all.freed'])/1024/1024)} Mib")
```
If we don't generate any output the "GPU allocated memory freed" would correspond to the internal profile done by vllm model_runner.profile_run() to determine how much KV block can be allocated without impacting the model execution. We could also propose a change to vllm a change to vllm repo to log the peak_memory

https://vscode.dev/github/vllm-project/vllm/blob/main/vllm/worker/worker.py#L187

C0casio45 commented 1 month ago

What's is the fp8 quantization/ How does it work?

You'll have a lot of information about the working schema of fp8 here , you also have an example here in python, feel free to check other sources, I may miss some interesting ones ...

tonyctalope commented 1 month ago

Based on your recent commits, it looks like you found solutions to parts 1 and 2, so I'm going to ignore them.

Update Three: 4. To test Model memory consomation using fp8 quantization we could do the following
from vllm import LLM
import torch
memory_stats_before = torch.cuda.memory_stats()
model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
llm = LLM(model=model_id, enforce_eager=True)
memory_stats = torch.cuda.memory_stats()
print(f"GPU allocated memory freed: {int((memory_stats['allocated_bytes.all.freed'] - memory_stats_before['allocated_bytes.all.freed'])/1024/1024)} Mib")
If we don't generate any output the "GPU allocated memory freed" would correspond to the internal profile done by vllm model_runner.profile_run() to determine how much KV block can be allocated without impacting the model execution. We could also propose a change to vllm a change to vllm repo to log the peak_memory

https://vscode.dev/github/vllm-project/vllm/blob/main/vllm/worker/worker.py#L187

I tried to use your piece of code in the following order of commands:

docker run --gpus all -it nvcr.io/nvidia/pytorch:23.07-py3
apt update && apt install -y python3-venv
mkdir vllm & cd vllm
python3 -m venv venv
source venv/bin/activate
pip install vllm
running your code -> KeyError: 'allocated_bytes.all.freed'

Additionally, I want to clarify how vLLM works:

It allocates fixed memory for the model parameters.
It allocates variable memory for context processing, which includes a KV cache, determined by the max_model_len parameter.

Also, if you need benchmarks for specific models, let me know the metrics, models, and context length for each one, and I can provide you those informations.

tonyctalope / gpu_poor

Up to date GPT_POOR #1

Add newest GPUs cards:

Modify Huggingface configuration handling:

Add newest quantization types:

Implement result verifications and tests: