Open robertgshaw2-neuralmagic opened 1 month ago
A gemma-2-27b-it in 8 bits for both a100
and h100
would be nice.
I tried to produce them myself but the resulting checkpoints return NaN
s when loaded into vLLM.
A gemma-2-27b-it in 8 bits for both
a100
andh100
would be nice. I tried to produce them myself but the resulting checkpoints returnNaN
s when loaded into vLLM.
Thanks - looking for fp8 for H100 and int8 for A100?
A gemma-2-27b-it in 8 bits for both
a100
andh100
would be nice. I tried to produce them myself but the resulting checkpoints returnNaN
s when loaded into vLLM.Thanks - looking for fp8 for H100 and int8 for A100?
Exactly!
A gemma-2-27b-it in 8 bits for both
a100
andh100
would be nice. I tried to produce them myself but the resulting checkpoints returnNaN
s when loaded into vLLM.Thanks - looking for fp8 for H100 and int8 for A100?
Exactly!
Can you share more about the issue you were seeing?
I'm getting empty generations and unserializeable logits, indicating NaNs in model outputs.
I used practically the same recipe as in the Llama-3.1-70b-Instruct-FP8
quant
recipe = """
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
input_activations:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
targets: ["Linear"]
"""
I'm getting empty generations and unserializeable logits, indicating NaNs in model outputs. I used practically the same recipe as in the
Llama-3.1-70b-Instruct-FP8
quantrecipe = """ quant_stage: quant_modifiers: QuantizationModifier: ignore: ["lm_head"] config_groups: group_0: weights: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true input_activations: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true targets: ["Linear"] """
Could be a FlashInfer issue. Ill work on an example for you
Hi @robertgshaw2-neuralmagic , could we get an update to https://huggingface.co/neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8 ? The main model https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1 had its tokenizer updated recently and it would be great to incorporate these into the quantized model.
Hi ! A phi-3-vision would be very nice in FP8 (ideally with k/v scales) Thanks in advance !
Hi @robertgshaw2-neuralmagic , could we get an update to https://huggingface.co/neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8 ? The main model https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1 had its tokenizer updated recently and it would be great to incorporate these into the quantized model.
Absolutely @Lin-K76 - could you update this when you have a chance this week
Hi ! A phi-3-vision would be very nice in FP8 (ideally with k/v scales) Thanks in advance !
We can take a look at this, adding support for Vision models is on our roadmap but we need to try it out a bit more.
I'm getting empty generations and unserializeable logits, indicating NaNs in model outputs. I used practically the same recipe as in the
Llama-3.1-70b-Instruct-FP8
quantrecipe = """ quant_stage: quant_modifiers: QuantizationModifier: ignore: ["lm_head"] config_groups: group_0: weights: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true input_activations: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true targets: ["Linear"] """
@BlackSamorez - I made a couple examples with gemma2
for you (https://github.com/vllm-project/llm-compressor/pull/78)
Note: gemma2
has been a bit unstable in vllm
due to the soft capping on the logits. We are stabilizing this as part of the current release process.
Here's install instructions on the vllm side:
export VLLM_VERSION=0.5.4
pip install [https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl](https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-$%7BVLLM_VERSION%7D-cp38-abi3-manylinux1_x86_64.whl)
pip install lm_eval==0.4.3
pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.2/flashinfer-0.1.2+cu121torch2.4-cp310-cp310-linux_x86_64.whl
Eval fp16
:
MODEL=google/gemma-2-27b-it
VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=$MODEL,add_bos_token=true --tasks gsm8k --num_fewshot 5 --limit 250 --batch_size "auto"
vllm (pretrained=google/gemma-2-27b-it,add_bos_token=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.864|± |0.0217|
| | |strict-match | 5|exact_match|↑ |0.848|± |0.0228|
Eval fp8
(made with the script):
MODEL=gemma-2-27b-it-FP8-Dynamic
VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=$MODEL,add_bos_token=true --tasks gsm8k --num_fewshot 5 --limit 250 --batch_size "auto"
vllm (pretrained=gemma-2-27b-it-FP8-Dynamic,add_bos_token=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.856|± |0.0222|
| | |strict-match | 5|exact_match|↑ |0.852|± |0.0225|
The strict-match
scores (the one that matters) is not impacted. This shows the fp8
quantization is working.
We will push a model up to the hub later this week once we have a chance to QA it.
Hi @robertgshaw2-neuralmagic , could we get an update to https://huggingface.co/neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8 ? The main model https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1 had its tokenizer updated recently and it would be great to incorporate these into the quantized model.
Hi, the new model is now live at https://huggingface.co/neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8.
Thanks @Lin-K76 !
Qwen2 series in marlin24
format. I'm having trouble generating model (0.5B and 72B) with proper output, getting NaN logits. Config in https://github.com/vllm-project/llm-compressor/issues/54.
Oneshot with 2:4 sparse or GPTQ alone is fine, but not both. Do I need to change my calibration dataset or GPTQ config?
Qwen2 series in
marlin24
format. I'm having trouble generating model (0.5B and 72B) with proper output, getting NaN logits. Config in #54.Oneshot with 2:4 sparse or GPTQ alone is fine, but not both. Do I need to change my calibration dataset or GPTQ config?
Thanks @yzlnew, I will take a look.
My suggestion though would be to use the W8A8 (int8 on ampere / fp8 on hopper) for production use cases as this will give you the best recovery and performance right now.
We are still working on making sparsity better. I will work on a demo for you later this week though :)
the Hermes 3 70b in int4 could be very great!
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !
How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !
Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !
How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !
Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models.
We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
Hi, can I please ask for a gemma-2-27b-int8? It's a good fit for 48GB cards and I'd love to run it with vLLM. Many quantization methods seem broken for this model unfortunately... would really appreciate it!
the Hermes 3 70b in int4 could be very great!
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models.
We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
DeepSeek-Coder-V2-Instruct in W4A16 would be great! Looking forward to your model release.
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models.
We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
I tried to quantize deepseek-coder-v2 to w4a16, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models. We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
What is your transformers version?
Also - note that quantization support for MoEs is still under construction in vllm.
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models.
We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
Do you mean this PR #7766 ? for W4A16 ? @robertgshaw2-neuralmagic
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models. We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
What is your transformers version?
Also - note that quantization support for MoEs is still under construction in vllm.
I see, I forgot to set trust_remote_code=True.
7766
yes
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models. We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
What is your transformers version?
Also - note that quantization support for MoEs is still under construction in vllm.
I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".
I tried to quantize llama2-7b to w8a8 , but it‘s too slow. i want to konw the reason.
I tried to quantize llama2-7b to w8a8 , but it‘s too slow. i want to konw the reason.
Are you running on a CPU?
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models. We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
What is your transformers version? Also - note that quantization support for MoEs is still under construction in vllm.
I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".
This usually means you’re running out of CPU memory. This is a big model … how much CPU RAM and GPU RAM do you have?
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models. We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
What is your transformers version? Also - note that quantization support for MoEs is still under construction in vllm.
I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".
This usually means you’re running out of CPU memory. This is a big model … how much CPU RAM and GPU RAM do you have?
about 2T MEM and 640G GPU Could you please tell me how to properly set the device_map parameter?
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models. We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
What is your transformers version? Also - note that quantization support for MoEs is still under construction in vllm.
I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".
This usually means you’re running out of CPU memory. This is a big model … how much CPU RAM and GPU RAM do you have?
I tried quantizing deepseek-coder-v2-instruct using 8 A100 80G GPUs. To avoid OOM, I set memory_limits to 35G.
When it reached the 32nd layer during quantization, the speed suddenly slowed down. I suspect that this portion of the parameters was loaded to the CPU, causing the slowdown. But why is it even slower than loading everything to the CPU?
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models. We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
What is your transformers version? Also - note that quantization support for MoEs is still under construction in vllm.
I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".
This usually means you’re running out of CPU memory. This is a big model … how much CPU RAM and GPU RAM do you have?
I tried quantizing deepseek-coder-v2-instruct using 8 A100 80G GPUs. To avoid OOM, I set memory_limits to 35G.
When it reached the 32nd layer during quantization, the speed suddenly slowed down. I suspect that this portion of the parameters was loaded to the CPU, causing the slowdown. But why is it even slower than loading everything to the CPU?
Can you try this example here with sequential_update
:
You'll need to install from source for this
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models. We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
What is your transformers version? Also - note that quantization support for MoEs is still under construction in vllm.
I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".
This usually means you’re running out of CPU memory. This is a big model … how much CPU RAM and GPU RAM do you have?
I tried quantizing deepseek-coder-v2-instruct using 8 A100 80G GPUs. To avoid OOM, I set memory_limits to 35G. When it reached the 32nd layer during quantization, the speed suddenly slowed down. I suspect that this portion of the parameters was loaded to the CPU, causing the slowdown. But why is it even slower than loading everything to the CPU?
Can you try this example here with
sequential_update
:You'll need to install from source for this Yes, I used sequential_update=True. Here is my code. If this is not set, it will use more GPU memory and cause OOM.
from llmcompressor.transformers import SparseAutoModelForCausalLM from transformers import AutoTokenizer import argparse from typing import Dict, Union
import psutil import torch from accelerate import infer_auto_device_map, init_empty_weights from transformers import AutoModelForCausalLM import flash_attn
print(flash_attn.version)
def custom_offload_device_map( model_stub: str, max_memory_per_gpu: Union[str, int], max_memory_gpu0: Union[str, int], num_gpus: int = 1, offload_buffers:bool=False, **model_kwargs, ) -> Dict[Union[int, str], Union[int, str]]: """ Calculates the optimal gpu mappings for model_stub stored as torch_dtype, where each GPU is restricted to allocating a specific amount of memory.
:param model_stub: local path or HF stub to calculate mapping for
:param max_memory_per_gpu: Max memory to allocate on each GPU, as either a string
such as "10GB" or an integer number of bytes
:param num_gpus: number of gpus to utilize
:param model_kwargs: keyword arguments to pass to model initializer
:return: memory mapping for layers of model_stub to be passed to from_pretrained()
"""
max_cpu_memory = psutil.virtual_memory().available
memory_limits = {device: max_memory_per_gpu for device in range(1, num_gpus)}
memory_limits[0] = max_memory_gpu0
memory_limits["cpu"] = max_cpu_memory
with init_empty_weights():
dummy_model = AutoModelForCausalLM.from_pretrained(model_stub, **model_kwargs)
device_map = infer_auto_device_map(
dummy_model,
max_memory=memory_limits,
no_split_module_classes=dummy_model._no_split_modules,
offload_buffers=offload_buffers
)
del dummy_model
return device_map
if name == "main": parser = argparse.ArgumentParser() parser.add_argument("--model-id", type=str, default=None) parser.add_argument("--dataset-dir", type=str, default=None) parser.add_argument("--save-dir", type=str, default=None)
parser.add_argument("--max-memory-per-gpu", type=str, default="35GB")
parser.add_argument("--max-memory-gpu0", type=str, default="35GB")
parser.add_argument("--device-map", type=str, default='auto')
parser.add_argument("--num-samples", type=int, default=512)
parser.add_argument("--offload-buffers",type=bool,default=False)
args = parser.parse_args()
from datasets import load_dataset
NUM_CALIBRATION_SAMPLES = args.num_samples
MAX_SEQUENCE_LENGTH = 2048
# Load dataset.
ds = load_dataset(args.dataset_dir, split="train_sft")
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
tokenizer = AutoTokenizer.from_pretrained(args.model_id, trust_remote_code=True)
# Preprocess the data into the format the model is trained with.
def preprocess(example):
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False, )}
ds = ds.map(preprocess)
# Tokenize the data (be careful with bos tokens - we need add_special_tokens=False since the chat_template already added it).
def tokenize(sample):
return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True,
add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
# Configure the quantization algorithm to run.
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"], sequential_update=True)
num_gpus = 8
if args.device_map == "cpu":
device_map = "cpu"
else:
device_map = custom_offload_device_map(
args.model_id, max_memory_per_gpu=args.max_memory_per_gpu, max_memory_gpu0=args.max_memory_gpu0,
num_gpus=num_gpus, trust_remote_code=True, torch_dtype=torch.bfloat16,offload_buffers=args.offload_buffers
)
model = SparseAutoModelForCausalLM.from_pretrained(
args.model_id, trust_remote_code=True, device_map=device_map, torch_dtype=torch.bfloat16,
)
# Apply quantization.
oneshot(
model=model, dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# Save to disk compressed.
model.save_pretrained(args.save_dir, save_compressed=True)
tokenizer.save_pretrained(args.save_dir)
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models. We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
What is your transformers version? Also - note that quantization support for MoEs is still under construction in vllm.
I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".
This usually means you’re running out of CPU memory. This is a big model … how much CPU RAM and GPU RAM do you have?
I tried quantizing deepseek-coder-v2-instruct using 8 A100 80G GPUs. To avoid OOM, I set memory_limits to 35G. When it reached the 32nd layer during quantization, the speed suddenly slowed down. I suspect that this portion of the parameters was loaded to the CPU, causing the slowdown. But why is it even slower than loading everything to the CPU?
Can you try this example here with
sequential_update
:You'll need to install from source for this
How should I load a w4a16 version of deepseek-v2 by vllm that was compressed using llm-compressor?
I used quantization=compressed-tensors, but it throws an error:
File "/usr/local/lib/python3.9/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 192, in init assert self.quant_method is not None
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models. We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
What is your transformers version? Also - note that quantization support for MoEs is still under construction in vllm.
I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".
This usually means you’re running out of CPU memory. This is a big model … how much CPU RAM and GPU RAM do you have?
I tried quantizing deepseek-coder-v2-instruct using 8 A100 80G GPUs. To avoid OOM, I set memory_limits to 35G. When it reached the 32nd layer during quantization, the speed suddenly slowed down. I suspect that this portion of the parameters was loaded to the CPU, causing the slowdown. But why is it even slower than loading everything to the CPU?
Can you try this example here with
sequential_update
:You'll need to install from source for this
How should I load a w4a16 version of deepseek-v2 by vllm that was compressed using llm-compressor? I used quantization=compressed-tensors, but it throws an error:
File "/usr/local/lib/python3.9/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 192, in init assert self.quant_method is not None
Release v0.5.6 will support it. Need this PR: https://github.com/vllm-project/vllm/pull/7766
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?
Hello! Currently in vllm, we only support FP8 inference for MoE models. We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.
I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
What is your transformers version? Also - note that quantization support for MoEs is still under construction in vllm.
I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".
This usually means you’re running out of CPU memory. This is a big model … how much CPU RAM and GPU RAM do you have?
I tried quantizing deepseek-coder-v2-instruct using 8 A100 80G GPUs. To avoid OOM, I set memory_limits to 35G. When it reached the 32nd layer during quantization, the speed suddenly slowed down. I suspect that this portion of the parameters was loaded to the CPU, causing the slowdown. But why is it even slower than loading everything to the CPU?
Can you try this example here with
sequential_update
:You'll need to install from source for this
How should I load a w4a16 version of deepseek-v2 by vllm that was compressed using llm-compressor? I used quantization=compressed-tensors, but it throws an error: File "/usr/local/lib/python3.9/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 192, in init assert self.quant_method is not None
Release v0.5.6 will support it. Need this PR: vllm-project/vllm#7766
Is this PR still in progress? Do you have an estimated timeline?
@robertgshaw2-neuralmagic I use this framework with 512 data points to calibrate the quantized deepseek-v2.5 model. The output result is "!!". Are there any tricks for quantizing this model? Here is my script:
from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
import argparse
from typing import Dict, Union
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
import psutil
import torch
from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoModelForCausalLM
import flash_attn
from datasets import load_dataset
print(flash_attn.__version__)
def custom_offload_device_map(
model_stub: str,
max_memory_per_gpu: Union[str, int],
max_memory_gpu0: Union[str, int],
num_gpus: int = 1,
offload_buffers: bool = False,
**model_kwargs,
) -> Dict[Union[int, str], Union[int, str]]:
"""
Calculates the optimal gpu mappings for model_stub stored as torch_dtype, where
each GPU is restricted to allocating a specific amount of memory.
:param model_stub: local path or HF stub to calculate mapping for
:param max_memory_per_gpu: Max memory to allocate on each GPU, as either a string
such as "10GB" or an integer number of bytes
:param num_gpus: number of gpus to utilize
:param model_kwargs: keyword arguments to pass to model initializer
:return: memory mapping for layers of model_stub to be passed to from_pretrained()
"""
max_cpu_memory = psutil.virtual_memory().available
memory_limits = {device: max_memory_per_gpu for device in range(1, num_gpus)}
memory_limits[0] = max_memory_gpu0
memory_limits["cpu"] = max_cpu_memory
with init_empty_weights():
dummy_model = AutoModelForCausalLM.from_pretrained(model_stub, **model_kwargs)
device_map = infer_auto_device_map(
dummy_model,
max_memory=memory_limits,
no_split_module_classes=dummy_model._no_split_modules,
offload_buffers=offload_buffers
)
del dummy_model
return device_map
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model-id", type=str, default="/opt/tiger/deepseek_http/models--deepseek-ai--DeepSeek-V2.5")
parser.add_argument("--dataset-dir", type=str,
default="/opt/tiger/deepseek_http/datasets--HuggingFaceH4--ultrachat_200k")
parser.add_argument("--max-memory-per-gpu", type=str, default="52GB")
parser.add_argument("--max-memory-gpu0", type=str, default="52GB")
parser.add_argument("--device-map", type=str, default='auto')
parser.add_argument("--num-samples", type=int, default=512)
parser.add_argument("--offload-buffers", action='store_true')
parser.add_argument("--max-model-len", type=int, default=8192)
parser.add_argument("--sequential-update", action='store_true')
parser.add_argument("--dataset-split", type=str, default='train_sft')
args = parser.parse_args()
# Select calibration dataset.
DATASET_ID = args.dataset_dir
DATASET_SPLIT = args.dataset_split
MAX_SEQUENCE_LENGTH = args.max_model_len
NUM_CALIBRATION_SAMPLES = args.num_samples
# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
tokenizer = AutoTokenizer.from_pretrained(args.model_id)
def preprocess(example):
if 'messages' in example:
messages = example['messages']
elif 'input' in example and 'output' in example:
messages = [
{
"role": "user",
"content": example['input']
},
{
"role": "assistant",
"content": example['output']
}
]
else:
raise ValueError("in valid example")
return {
"text": tokenizer.apply_chat_template(
messages,
tokenize=False,
)
}
ds = ds.map(preprocess)
# Tokenize inputs.
def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
ds = ds.map(tokenize, remove_columns=ds.column_names)
# define a llmcompressor recipe for W8A8 quantization
recipe = GPTQModifier(
targets="Linear", scheme="W4A16", ignore=["lm_head"], sequential_update=args.sequential_update
)
if args.device_map == "cpu":
model = SparseAutoModelForCausalLM.from_pretrained(
args.model_id, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True
)
else:
device_map = custom_offload_device_map(
model_stub=args.model_id,
max_memory_per_gpu=args.max_memory_per_gpu,
max_memory_gpu0=args.max_memory_gpu0,
num_gpus=8,
offload_buffers=args.offload_buffers,
trust_remote_code=True
)
model = SparseAutoModelForCausalLM.from_pretrained(
args.model_id, device_map=device_map, torch_dtype=torch.bfloat16, trust_remote_code=True
)
SAVE_DIR = args.model_id + '-W4A16'
oneshot(
model=model, dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# Save to disk compressed.
model.save_pretrained(SAVE_DIR, save_compressed=True,
skip_compression_stats=True)
tokenizer.save_pretrained(SAVE_DIR)
@robertgshaw2-neuralmagic I use this framework with 512 data points to calibrate the quantized deepseek-v2.5 model. The output result is "!!". Are there any tricks for quantizing this model? Here is my script:
from llmcompressor.transformers import SparseAutoModelForCausalLM from transformers import AutoTokenizer import argparse from typing import Dict, Union from llmcompressor.transformers import oneshot from llmcompressor.modifiers.quantization import GPTQModifier import psutil import torch from accelerate import infer_auto_device_map, init_empty_weights from transformers import AutoModelForCausalLM import flash_attn from datasets import load_dataset print(flash_attn.__version__) def custom_offload_device_map( model_stub: str, max_memory_per_gpu: Union[str, int], max_memory_gpu0: Union[str, int], num_gpus: int = 1, offload_buffers: bool = False, **model_kwargs, ) -> Dict[Union[int, str], Union[int, str]]: """ Calculates the optimal gpu mappings for model_stub stored as torch_dtype, where each GPU is restricted to allocating a specific amount of memory. :param model_stub: local path or HF stub to calculate mapping for :param max_memory_per_gpu: Max memory to allocate on each GPU, as either a string such as "10GB" or an integer number of bytes :param num_gpus: number of gpus to utilize :param model_kwargs: keyword arguments to pass to model initializer :return: memory mapping for layers of model_stub to be passed to from_pretrained() """ max_cpu_memory = psutil.virtual_memory().available memory_limits = {device: max_memory_per_gpu for device in range(1, num_gpus)} memory_limits[0] = max_memory_gpu0 memory_limits["cpu"] = max_cpu_memory with init_empty_weights(): dummy_model = AutoModelForCausalLM.from_pretrained(model_stub, **model_kwargs) device_map = infer_auto_device_map( dummy_model, max_memory=memory_limits, no_split_module_classes=dummy_model._no_split_modules, offload_buffers=offload_buffers ) del dummy_model return device_map if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--model-id", type=str, default="/opt/tiger/deepseek_http/models--deepseek-ai--DeepSeek-V2.5") parser.add_argument("--dataset-dir", type=str, default="/opt/tiger/deepseek_http/datasets--HuggingFaceH4--ultrachat_200k") parser.add_argument("--max-memory-per-gpu", type=str, default="52GB") parser.add_argument("--max-memory-gpu0", type=str, default="52GB") parser.add_argument("--device-map", type=str, default='auto') parser.add_argument("--num-samples", type=int, default=512) parser.add_argument("--offload-buffers", action='store_true') parser.add_argument("--max-model-len", type=int, default=8192) parser.add_argument("--sequential-update", action='store_true') parser.add_argument("--dataset-split", type=str, default='train_sft') args = parser.parse_args() # Select calibration dataset. DATASET_ID = args.dataset_dir DATASET_SPLIT = args.dataset_split MAX_SEQUENCE_LENGTH = args.max_model_len NUM_CALIBRATION_SAMPLES = args.num_samples # Load dataset and preprocess. ds = load_dataset(DATASET_ID, split=DATASET_SPLIT) ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) tokenizer = AutoTokenizer.from_pretrained(args.model_id) def preprocess(example): if 'messages' in example: messages = example['messages'] elif 'input' in example and 'output' in example: messages = [ { "role": "user", "content": example['input'] }, { "role": "assistant", "content": example['output'] } ] else: raise ValueError("in valid example") return { "text": tokenizer.apply_chat_template( messages, tokenize=False, ) } ds = ds.map(preprocess) # Tokenize inputs. def tokenize(sample): return tokenizer( sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False, ) ds = ds.map(tokenize, remove_columns=ds.column_names) # define a llmcompressor recipe for W8A8 quantization recipe = GPTQModifier( targets="Linear", scheme="W4A16", ignore=["lm_head"], sequential_update=args.sequential_update ) if args.device_map == "cpu": model = SparseAutoModelForCausalLM.from_pretrained( args.model_id, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True ) else: device_map = custom_offload_device_map( model_stub=args.model_id, max_memory_per_gpu=args.max_memory_per_gpu, max_memory_gpu0=args.max_memory_gpu0, num_gpus=8, offload_buffers=args.offload_buffers, trust_remote_code=True ) model = SparseAutoModelForCausalLM.from_pretrained( args.model_id, device_map=device_map, torch_dtype=torch.bfloat16, trust_remote_code=True ) SAVE_DIR = args.model_id + '-W4A16' oneshot( model=model, dataset=ds, recipe=recipe, max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, ) # Save to disk compressed. model.save_pretrained(SAVE_DIR, save_compressed=True, skip_compression_stats=True) tokenizer.save_pretrained(SAVE_DIR)
Thanks @fengyang95 - @dsikka is looking into this
Hey @fengyang95 - investigating this issue. Will update once fixed. Thanks!
Hi @fengyang95 - can you share the code you're using which generates !!!
?
We have also added this example which you can follow:
https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/deepseek_moe_w8a8.py
You can swap the model to the lager model and the scheme to W4A16
.
You'll need to use the latest main to pull in a fix that was needed for deepseek_v2
pull in a fix that was needed for deepseek_v
python3 -m vllm.entrypoints.openai.api_server --model DeepSeek-V2.5-W4A16 ---served-model-name dsv2 --trust-remote-code --tensor-parallel-size 8 --max-model-len 16384 --port $PORT0 --gpu-memory-utilization 0.9 --quantization compressed-tensors --force-eager
python3 -m vllm.entrypoints.openai.api_server --model DeepSeek-V2.5-W4A16 ---served-model-name dsv2 --trust-remote-code --tensor-parallel-size 8 --max-model-len 16384 --port $PORT0 --gpu-memory-utilization 0.9 --quantization compressed-tensors --force-eager
Hi @fengyang95 - can you share the code you're using which generates
!!!
?We have also added this example which you can follow: https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/deepseek_moe_w8a8.py You can swap the model to the lager model and the scheme to
W4A16
.You'll need to use the latest main to pull in a fix that was needed for deepseek_v2
Thank you, I'll try it right away.
Hi @fengyang95 - can you share the code you're using which generates
!!!
?We have also added this example which you can follow: https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/deepseek_moe_w8a8.py You can swap the model to the lager model and the scheme to
W4A16
.You'll need to use the latest main to pull in a fix that was needed for deepseek_v2
Hi @dsikka , I followed your suggestion to ignore the gate parameter and updated the code. However, the quantized model still outputs "!!!". Have you tested this on DeepSeek-v2.5?
Hi @fengyang95 there was a bug in vLLM which has now been fixed on main. Do you mind trying it again? We have also added a W4A16 end-to-end example: https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/deepseek_moe_w4a16.py
Hi @fengyang95 there was a bug in vLLM which has now been fixed on main. Do you mind trying it again? We have also added a W4A16 end-to-end example: https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/deepseek_moe_w4a16.py
I'll try it asap
Please comment here any model requests for:
llm-compressor