mobiusml / hqq

Official implementation of Half-Quadratic Quantization (HQQ)
https://mobiusml.github.io/hqq_blog/
Apache License 2.0
670 stars 65 forks source link

HQQ OOMs on large models #29

Closed rationalism closed 5 months ago

rationalism commented 6 months ago

Hey, I have a machine with two 4090 GPUs (24 GB VRAM each). When I try to run HQQ quantization of Llama-2-70B:

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer

#Model and setttings
model_id      = 'meta-llama/Llama-2-70b-chat-hf'
compute_dtype = torch.float16
device        = 'cuda:0'

#Load model on the CPU
######################
model     = HQQModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype)
tokenizer = AutoTokenizer.from_pretrained(model_id) 

#Quantize the model
######################
from hqq.core.quantize import *
quant_config = BaseQuantizeConfig(nbits=4, group_size=64)
model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=device) 

the first half of the layers seem to work fine, but then it OOMs, presumably because it tries to put the entire quantized model on a single GPU device. For Llama-2-70B, I could try renting an A100 machine and that should work, but for even larger models (eg. Grok-1) it would be impossible to fit the entire thing on a single GPU. Is splitting quantization across multiple GPUs supported, or planned to be supported in the future? Thanks :)

mobicham commented 6 months ago

Hi @rationalism, yeah unfortunately loading automatically to multiple GPUs is not supported. Maybe you can try:

Otherwise, I can take a stab at it and see how to do it on 2 GPUs, or more generally how to do it automatically.

Minami-su commented 6 months ago

Hey, I have a machine with two 4090 GPUs (24 GB VRAM each). When I try to run HQQ quantization of Llama-2-70B:

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer

#Model and setttings
model_id      = 'meta-llama/Llama-2-70b-chat-hf'
compute_dtype = torch.float16
device        = 'cuda:0'

#Load model on the CPU
######################
model     = HQQModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype)
tokenizer = AutoTokenizer.from_pretrained(model_id) 

#Quantize the model
######################
from hqq.core.quantize import *
quant_config = BaseQuantizeConfig(nbits=4, group_size=64)
model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=device) 

the first half of the layers seem to work fine, but then it OOMs, presumably because it tries to put the entire quantized model on a single GPU device. For Llama-2-70B, I could try renting an A100 machine and that should work, but for even larger models (eg. Grok-1) it would be impossible to fit the entire thing on a single GPU. Is splitting quantization across multiple GPUs supported, or planned to be supported in the future? Thanks :)

Same problem.

rationalism commented 5 months ago

@mobicham Thanks. With larger models like DBRX coming out this year, I think being able to split across multiple GPUs will be an important feature to manage that

https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

Sneakr commented 5 months ago

@mobicham I have rtx 4090 and 128gb ram, is it possible to load the original mixtral instruct and quantize it using HQQ ? Currently my script gets killed while loading as in the example file, I suppose I need to use your already quantized mixtral you linked , the 2-bit 4-bit , right?

Sneakr commented 5 months ago

I managed to solve it by increasing WSL memory allocation and page swap file, nice! :)

mobicham commented 5 months ago

Yeah increasing the swap should do it, but it's gonna be slow. Otherwise, you can use this branch of transformers that supports on-the-fly loading and HQQ quantization, so you don't need a lot of ram: https://github.com/huggingface/transformers/pull/29637/ Soon it will be integrated into transformers and you wouldn't face this memory issue, I just need to fix a couple of things for the pull request.

catid commented 5 months ago

Would love to be able to actually use this model lol: https://huggingface.co/catid/cat-llama-3-70b-hqq

Need support for device_map="auto"

model_id = 'catid/cat-llama-3-70b-hqq'

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = HQQModelForCausalLM.from_quantized(model_id)
(hqq) ➜  openai-hqq-server git:(main) ✗ python test.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Failed to load the weights
Traceback (most recent call last):
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/hqq/models/base.py", line 328, in from_quantized
    weights = cls.load_weights(save_dir)
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/hqq/models/base.py", line 195, in load_weights
    return torch.load(cls.get_weight_file(save_dir), map_location=map_location)
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/torch/serialization.py", line 1026, in load
    return _load(opened_zipfile,
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/torch/serialization.py", line 1438, in _load
    result = unpickler.load()
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/torch/serialization.py", line 1408, in persistent_load
    typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/torch/serialization.py", line 1382, in load_tensor
    wrap_storage=restore_location(storage, location),
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/torch/serialization.py", line 391, in default_restore_location
    result = fn(storage, location)
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/torch/serialization.py", line 271, in _cuda_deserialize
    return obj.cuda(device)
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/torch/_utils.py", line 115, in _cuda
    untyped_storage = torch.UntypedStorage(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 64.81 MiB is free. Including non-PyTorch memory, this process has 23.57 GiB memory in use. Of the allocated memory 23.19 GiB is allocated by PyTorch, and 9.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
mobicham commented 5 months ago

Yeah I am aware of this, there should be a simple fix but I've been very busy with other things. I hope I will have the time to take a look at it in the next days. Sorry for the delay!

mobicham commented 5 months ago

You can now shard quantized models on multiple gpus. Just pass the list of devices as a list like this:

model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=['cuda:0', 'cuda:1'])

You still need to have the main model on CPU before quantizing. Will see how to dynamically dispatch directly to the gpu.

catid commented 5 months ago

That worked thanks, just in time! https://huggingface.co/catid/cat-llama-3-70b-san66-hqq

mobicham commented 5 months ago

@catid making it work with "from_quantized" would require some additional work. But if you quantize directly it should work fine, as long as it's an official HF model that follows the same layer naming logic.

mobicham commented 5 months ago

Closing this since HQQ now is integrated with transformers: https://github.com/huggingface/transformers/pull/29637