Closed nickandbro closed 3 months ago
Hi @nickandbro mixtral should work fine, although I would recommend skipping quantization of the gate layers since we don't quantize those in vLLM.
I have an example here for mixtral 8x7B with the proper layers skipped. https://github.com/neuralmagic/AutoFP8/blob/main/examples/example_mixtral.py
How much GPU memory do you have available? For mixtral 8x22B you need at least 350GB of GPU memory available to load it in BF16, so I wouldn't use anything less than an 8xA100 or 8xH100 system. You could try forcing the model onto CPU memory if you don't have such a system available.
@mgoin Thanks for the fast response!
I am using a 8xL40S (48gb each) machine. From what I understand that should be 384 of VRAM. I tried using that config and swapped out with Mixtral 8x22B but still get the error. Unfortunately do not have access to a H100, but could rent one from Lambda if I can load the fp8 model on my machine after.
@nickandbro I can make the model for you for now (downloading weights), but I do see that you seem to have loaded the checkpoint into memory fine and are just hitting OOM while performing the weight quantization. Please try this PR to see if it makes a difference by performing GC after each weight tensor is replaced https://github.com/neuralmagic/AutoFP8/pull/14
@mgoin
Thank you! I'll let you know my error when I get back to my rig. If you do manage to put a quantized fp8 that is using dynamic scaling it would be much appreciated!
@nickandbro you can find a dynamic scaling quantized checkpoint here, enjoy! https://huggingface.co/neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic
I landed the above PR since it seemed to help with peak memory a bit.
I am quantizing a model using this code:
but receive this error:
Does AutoFP8 only support Llama based models? Having support for Mixtral8x22B would really help my team. Thank you for looking into this!