Open cduk opened 3 months ago
is there any progress?
What specific GPU did you use? It is a known issue that in-place quantization requires enough space to hold the entire unquantized weights. We don't plan on tackling this at the moment as we recommend offline quantization in the docs
I used a RTX 3090 (24GB VRAM). I will quantize offline, this is anyway more efficient.
I used a RTX 3090 (24GB VRAM). I will quantize offline, this is anyway more efficient. @cduk Yeah it wont work like that, It will attempt to load the entire model before preforming weight quantization, If you use a script like this it will convert it to W8A16 using FP8, feel free to use this script, although i think I messed it up, it seems to make two copies of the fp8 model to the disk (and i really don't care to fix it cause it works lol).
A simple work around for this would be do preform layer wise quantization from the CPU and transfer the layers as they are quantized from the CPU to the GPU at run time (i think that's what BNB does), so you wouldn't have to bother using this script. @mgoin
This script instead of loading the model to GPU vram will load it to CPU ram, and preform the quantization there, then save it to disk.
import os
from transformers import AutoTokenizer
from llmcompressor.transformers import SparseAutoModelForCausalLM
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
def get_user_input():
"""Get model configuration from user input"""
print("\n=== Model Quantization Configuration ===")
while True:
model_id = input("\nEnter the HuggingFace model ID (e.g., meta-llama/Llama-2-7b-chat-hf): ").strip()
if model_id:
break
print("Model ID cannot be empty. Please try again.")
return model_id
def quantize_model_fp8(model_id):
"""
Quantize a model to FP8 Dynamic format using llm-compressor on CPU.
Args:
model_id (str): HuggingFace model ID
"""
try:
print(f"\nLoading model and tokenizer: {model_id}")
model = SparseAutoModelForCausalLM.from_pretrained(
model_id,
device_map="cpu",
torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
print("\nConfiguring FP8 quantization recipe...")
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=["lm_head"]
)
print("\nApplying quantization (this may take a while)...")
oneshot(model=model, recipe=recipe)
model_name = model_id.split("/")[-1]
save_dir = f"{model_name}-FP8-Dynamic"
print(f"\nSaving quantized model to: {save_dir}")
model.save_pretrained(save_dir, save_compressed=True)
tokenizer.save_pretrained(save_dir)
print("\n✅ Quantization completed successfully!")
print(f"📁 Quantized model saved to: {os.path.abspath(save_dir)}")
return save_dir
except Exception as e:
print(f"\n❌ Error during quantization: {str(e)}")
return None
if __name__ == "__main__":
print("""
╔══════════════════════════════════════╗
║ Model Quantization to FP8 ║
║ (Dynamic Per-Token) ║
╚══════════════════════════════════════╝
""")
model_id = get_user_input()
print("\n=== Configuration Summary ===")
print(f"Model ID: {model_id}")
print("Quantization Type: FP8 Dynamic (per-token)")
print("Device: CPU")
while True:
confirm = input("\nProceed with quantization? (y/n): ").lower().strip()
if confirm in ['y', 'n']:
break
print("Please enter 'y' for yes or 'n' for no.")
if confirm == 'y':
quantized_model_path = quantize_model_fp8(model_id)
else:
print("\nQuantization cancelled.")
Hi @NicolasMejiaPetit we have native support for cpu offloading using accelerate. It can deal with loading the model across multiple GPUs or CPU. Please see the example here https://github.com/vllm-project/llm-compressor/tree/main/examples/big_models_with_accelerate
Hi @NicolasMejiaPetit we have native support for cpu offloading using accelerate. It can deal with loading the model across multiple GPUs or CPU. Please see the example here https://github.com/vllm-project/llm-compressor/tree/main/examples/big_models_with_accelerate
Oh that's awesome it took a good minute to quantize qwen coder 2.5 32b on a 2016 intel cpu lol.
I only did fp8 since it doesn't require calibration, so that's a bet I'll be able to do full int8 with calibration
Your current environment
Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.28.3 Libc version: glibc-2.35
Python version: 3.12.1 | packaged by Anaconda, Inc. | (main, Jan 19 2024, 15:51:05) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Ti Nvidia driver version: 550.90.07 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 40 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 6
On-line CPU(s) list: 0-5 Vendor ID: AuthenticAMD Model name: QEMU Virtual CPU version 2.5+ CPU family: 15
Model: 107 Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 1
Stepping: 1
BogoMIPS: 6986.87 Flags: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm rep_good nopl cpuid extd_apicid tsc_known_freq pni ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes hypervisor lahf_lm cmp_legacy 3dnowprefetch vmmcall Hypervisor vendor: KVM Virtualization type: full L1d cache: 384 KiB (6 instances) L1i cache: 384 KiB (6 instances) L2 cache: 3 MiB (6 instances) L3 cache: 96 MiB (6 instances) NUMA node(s): 1
NUMA node0 CPU(s): 0-5 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Not affected Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerability Spectre v2: Vulnerable; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected
Versions of relevant libraries: [pip3] mypy==1.5.1 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==8.9.2.26 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.4.127 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.0.2 [pip3] torch==2.3.0 [pip3] transformers==4.42.3 conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 8.9.2.26 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pyzmq 26.0.2 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] transformers 4.42.3 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: ^[[4mGPU0 CPU Affinity NUMA Affinity GPU NUMA ID^[[0m GPU0 X 0-5 0 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
🐛 Describe the bug
There is a regression that took place between around 4 weeks ago until a week ago.
Running Nemo model with FP8 quantization using this command:
--model mistralai/Mistral-Nemo-Instruct-2407 --max-model-len 8192 --gpu-memory-utilization 0.7 --quantization fp8 --enable-prefix-caching --enforce-eager
Previously worked using only around 16 GB of VRAM.
However, there has since been a regression leading to an out of memory error even with 24GB
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 23.67 GiB of which 37.88 MiB is free. Process 258755 has 23.63 GiB memory in use. Of the allocated memory 23.28 GiB is allocated by PyTorch, and 51.00 MiB is reserved by PyTorch but unallocated.
It appears it happens during marling weight re-packing:
marlin_qweight = ops.gptq_marlin_repack(b_q_weight=pack_fp8_to_int32( File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py"
Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 230, in run_rpc_server server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 31, in init self.engine = AsyncLLMEngine.from_engine_args( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 740, in from_engine_args engine = cls( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in init self.engine = self._init_engine(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine return engine_class(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 272, in init super().init(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 276, in init self.model_executor = executor_class( File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 46, in init self._init_executor() File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 39, in _init_executor self.driver_worker.load_model() File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 182, in load_model self.model_runner.load_model() File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 880, in load_model self.model = get_model(model_config=self.model_config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model return loader.load_model(model_config=model_config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 361, in load_model quant_method.process_weights_after_loading(module) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 239, in process_weights_after_loading prepare_fp8_layer_for_marlin(layer) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 67, in prepare_fp8_layer_for_marlin marlin_qweight = ops.gptq_marlin_repack(b_q_weight=pack_fp8_to_int32( File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 102, in pack_fp8_to_int32 (byte_tensor[:, 1].to(torch.int32) << 8) | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 23.67 GiB of which 37.88 MiB is free. Process 258755 has 23.63 GiB memory in use. Of the allocated memory 23.28 GiB is allocated by PyTorch, and 51.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)