Closed rohit-gupta closed 6 months ago
I see, that's actually a bug, thanks for reporting, I will fix this soon. Work around, after you load, before you apply the patching, do this:
from hqq.utils.patching import patch_linearlayers, patch_add_quant_config
patch_linearlayers(model, patch_add_quant_config, quant_config)
# where quant_config is the quant config you used to quantize the model
Your quant settings are also not correct to work with that backend, as the documentation says, you need to use axis=1
. It will not use the faster backend if you feed it the default axis=0
. Try:
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1)
Oh thanks for the help !
I think this should fix it: https://github.com/mobiusml/hqq/commit/3a62b11a82f1ab81b3f902a224ac11cdc2cbd1ab
Let me know if you still have the same issue
@mobicham this is tangentially related, but the new quantization config caused the model's size to increase from 73 to 75GB which makes it no longer fit on a single A100, so I was trying to use 2 A6000s. Is that possible to do in HQQ ? I tried passing device_map
but it seems unlike the HuggingFace version HQQ doesn't support that.
Traceback (most recent call last):
File "/home/rohitg/vision_llm/scratch/infer_saved_llm.py", line 11, in <module>
model = HQQModelForCausalLM.from_quantized("./quantized_mixtral_huge/", device_map='auto', compute_dtype=torch.bfloat16)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: HQQWrapper.from_quantized() got an unexpected keyword argument 'device_map'
Oh I see, because by default it quantizes the meta-data, and the settings I shared turned them off. You have two options:
axis = 0
attn_params = BaseQuantizeConfig(nbits=4, group_size=64, axis=axis)
experts_params = BaseQuantizeConfig(nbits=3, group_size=64, axis=axis)
quant_config = {}
quant_config['self_attn.q_proj'] = attn_prams quant_config['self_attn.k_proj'] = attn_prams quant_config['self_attn.v_proj'] = attn_prams quant_config['self_attn.o_proj'] = attn_prams
quant_config['block_sparse_moe.experts.w1'] = experts_params quant_config['block_sparse_moe.experts.w2'] = experts_params quant_config['block_sparse_moe.experts.w3'] = experts_params
from hqq.utils.patching import prepare_for_inference HQQLinear.set_backend(HQQBackend.ATEN if (axis==0) else HQQBackend.PYTORCH) prepare_for_inference(model)
torch.compile(...)
With settings like this, you'd expect a drop of about ~1-1.5 point in performance: https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ
Option 2:
Multi-gpu: just pass `device=['cuda:0', 'cuda:1']` here:
```Python
model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=['cuda:0', 'cuda:1'])
You can also do it with transformers directly (pip install git+https://github.com/huggingface/transformers.git
) for that you need to use HqqConfig
as explained here: https://huggingface.co/docs/transformers/main/en/quantization#hqq
I think multi-gpu runtime with the hqq lib is faster than transformers, at least with the models I tried. Let me know !
So question about option 2, can I utilize multi-GPU with quantized weights that have already been saved ?
HQQModelForCausalLM.from_quantized(device=['cuda:0', 'cuda:1'])
results in errors:
File "/home/rohitg/vision_llm/scratch/infer_saved_llm.py", line 11, in <module>
model = HQQModelForCausalLM.from_quantized("./quantized_mixtral_huge/", device=['cuda:0', 'cuda:1'], compute_dtype=torch.bfloat16)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c3-0/rohitg/mforge/envs/quantllm/lib/python3.11/site-packages/hqq/engine/base.py", line 86, in from_quantized
model = cls._get_hqq_class(arch_key).from_quantized(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c3-0/rohitg/mforge/envs/quantllm/lib/python3.11/site-packages/hqq/models/base.py", line 469, in from_quantized
cls.patch_model(
File "/home/c3-0/rohitg/mforge/envs/quantllm/lib/python3.11/site-packages/hqq/models/base.py", line 185, in patch_model
cls.patch_nonlinearlayers(model, patch_nonlinear_fct, verbose=verbose)
File "/home/c3-0/rohitg/mforge/envs/quantllm/lib/python3.11/site-packages/hqq/models/hf/mixtral.py", line 26, in patch_nonlinearlayers
model.lm_head = patch_fct(model.lm_head) ###
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c3-0/rohitg/mforge/envs/quantllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/c3-0/rohitg/mforge/envs/quantllm/lib/python3.11/site-packages/hqq/models/base.py", line 459, in _load_module
state_dict[key].to(
TypeError: to() received an invalid combination of arguments - got (non_blocking=bool, dtype=torch.dtype, device=list, ), but expected one of:
* (torch.device device, torch.dtype dtype, bool non_blocking, bool copy, *, torch.memory_format memory_format)
* (torch.dtype dtype, bool non_blocking, bool copy, *, torch.memory_format memory_format)
* (Tensor tensor, bool non_blocking, bool copy, *, torch.memory_format memory_format)```
@rohit-gupta multi-gpu was only implemented for the quantization call. Let me see how to add that to the from_quantized
call.
@rohit-gupta I created a separate issue for this since it's not related to the original thread: https://github.com/mobiusml/hqq/issues/71 . I will give it a try tomorrow.
Basically, when I quantize a model and patch it to use torchao_int4 ops, it works, but if I then save this model and load it again the patching fails. Am I doing something wrong ? I have been trying to follow the instructions.
This works:
Output:
However, when I then try to load the quantized and saved model the patching step fails: