neuralmagic / AutoFP8

Apache License 2.0
150 stars 17 forks source link

Quantization of Mixtral 8x22B #13

Closed nickandbro closed 3 months ago

nickandbro commented 3 months ago

I am quantizing a model using this code:

from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
pretrained_model_dir = "./Mixtral-8x22B-Instruct-v0.1"
quantized_model_dir = "./Mixtral-8x22B-Instruct-v0.1_dynamic_fp8"
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")
examples = []
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)       

but receive this error:

Loading model with the following kwargs: {'torch_dtype': 'auto', 'device_map': 'auto', 'cache_dir': None, 'force_download': False, 'proxies': None, 'resume_download': False, 'local_files_only': False, 'use_auth_token': None, 'revision': None, 'subfolder': '', '_commit_hash': None}
Loading checkpoint shards: 100%|█████| 59/59 [00:44<00:00,  1.33it/s]
Traceback (most recent call last):
  File "/data1/darksaber/niprgpt/models_nick/quantization.py", line 13, in <module>
    model.quantize(examples)
  File "/data1/darksaber/niprgpt/models_nick/AutoFP8/auto_fp8/modeling.py", line 107, in quantize
    quantize_weights(self.model, self.quantize_config)
  File "/data1/darksaber/niprgpt/models_nick/AutoFP8/auto_fp8/quantize.py", line 204, in quantize_weights
    quant_weight, quant_scale = per_tensor_quantize(linear.weight)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/darksaber/niprgpt/models_nick/AutoFP8/auto_fp8/quantize.py", line 58, in per_tensor_quantize
    qweight = qweight.to(torch.float8_e4m3fn)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 

Does AutoFP8 only support Llama based models? Having support for Mixtral8x22B would really help my team. Thank you for looking into this!

mgoin commented 3 months ago

Hi @nickandbro mixtral should work fine, although I would recommend skipping quantization of the gate layers since we don't quantize those in vLLM.

I have an example here for mixtral 8x7B with the proper layers skipped. https://github.com/neuralmagic/AutoFP8/blob/main/examples/example_mixtral.py

How much GPU memory do you have available? For mixtral 8x22B you need at least 350GB of GPU memory available to load it in BF16, so I wouldn't use anything less than an 8xA100 or 8xH100 system. You could try forcing the model onto CPU memory if you don't have such a system available.

nickandbro commented 3 months ago

@mgoin Thanks for the fast response!

I am using a 8xL40S (48gb each) machine. From what I understand that should be 384 of VRAM. I tried using that config and swapped out with Mixtral 8x22B but still get the error. Unfortunately do not have access to a H100, but could rent one from Lambda if I can load the fp8 model on my machine after.

mgoin commented 3 months ago

@nickandbro I can make the model for you for now (downloading weights), but I do see that you seem to have loaded the checkpoint into memory fine and are just hitting OOM while performing the weight quantization. Please try this PR to see if it makes a difference by performing GC after each weight tensor is replaced https://github.com/neuralmagic/AutoFP8/pull/14

nickandbro commented 3 months ago

@mgoin

Thank you! I'll let you know my error when I get back to my rig. If you do manage to put a quantized fp8 that is using dynamic scaling it would be much appreciated!

mgoin commented 3 months ago

@nickandbro you can find a dynamic scaling quantized checkpoint here, enjoy! https://huggingface.co/neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic

I landed the above PR since it seemed to help with peak memory a bit.