huggingface / huggingface-llama-recipes

531 stars 59 forks source link

[BUG] missing `quantization_config` in fp8-405B.ipynb #38

Closed ianporada closed 1 month ago

ianporada commented 1 month ago

huggingface-llama-recipes/fp8-405B.ipynb says

Let's load the model. To quantize the model on the fly, we pass a quantization_config:

But no quantization_config is passed. Maybe the model was intended to be loaded with a quantization_config such as:

quantization_config = QuantoConfig(weights="int8")
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="auto", quantization_config=quantization_config)

@SunMarc

ianporada commented 1 month ago

Or it looks like this line was just left over from the previous template. In that case I'm wondering if torch_dtype="auto" should be used rather than torch_dtype=torch.bfloat16 when loading "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8" since certain weights are F8_E4M3.

SunMarc commented 1 month ago

Hey @ianporada, thanks for the report. The line is indeed a left over from a previous template. Since the model is already quantized, you don't need to specify the quantization_config. Would you like to submit a PR to fix the notebook ? Thanks !

ianporada commented 1 month ago

Sure! Created a pull request: https://github.com/huggingface/huggingface-llama-recipes/pull/40

I'm also curious why some of the 405B-FP8 weights are FP32, larger than the original model, but that's a separate question so I've aksed in the forum: https://discuss.huggingface.co/t/why-are-some-weights-fp32-in-llama-3-1-405b-fbgemm-fp8-quantiziation/108922

SunMarc commented 1 month ago

Thanks !