huggingface / optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
https://huggingface.co/docs/optimum/main/
Apache License 2.0
2.57k stars 469 forks source link

Mixtral-8x7B-Instruct-v0.1-GPTQ AssertionError #1742

Open paolovic opened 8 months ago

paolovic commented 8 months ago

System Info

Name: optimum
Version: 1.18.0.dev0
Name: transformers
Version: 4.36.0
Name: auto-gptq
Version: 0.6.0.dev0+cu118
CUDA Version: 11.8
Python 3.8.17

Who can help?

No response

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Hi, I am trying to deploy Mixtral-8x7B-Instruct-v0.1-GPTQ in 4bit precision with Ray.

Unfortunately, it keeps failing with the following error message:

          The deployment failed to start 3 times in a row. This may be due to a problem with its constructor or initial health check failing. See controller logs for details. Retrying after 1 seconds. Error:
          ray::ServeReplica:Mixtral_8x7B:ModelAPI.initialize_and_get_metadata() (pid=x, ip=x, actor_id=x, repr=<ray.serve._private.replica.ServeReplica:Mixtral_8x7B:ModelAPI object at x>)
            File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 437, in result
              return self.__get_result()
            File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 389, in __get_result
              raise self._exception
            File "/ray_env/lib64/python3.8/site-packages/ray/serve/_private/replica.py", line 442, in initialize_and_get_metadata
              raise RuntimeError(traceback.format_exc()) from None
          RuntimeError: Traceback (most recent call last):
            File "/ray_env/lib64/python3.8/site-packages/ray/serve/_private/replica.py", line 430, in initialize_and_get_metadata
              await self._initialize_replica()
            File "/ray_env/lib64/python3.8/site-packages/ray/serve/_private/replica.py", line 190, in initialize_replica
              await sync_to_async(_callable.__init__)(*init_args, **init_kwargs)
            File "/ray_env/lib64/python3.8/site-packages/ray/serve/api.py", line 243, in __init__
              cls.__init__(self, *args, **kwargs)
            File "/ray/serve_mixtral.py", line 32, in __init__
              self._pipe = pipeline("text-generation", model=self._path,
            File "/ray_env/lib64/python3.8/site-packages/transformers/pipelines/__init__.py", line 870, in pipeline
              framework, model = infer_framework_load_model(
            File "/ray_env/lib64/python3.8/site-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
              model = model_class.from_pretrained(model, **kwargs)
            File "/ray_env/lib64/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
              return model_class.from_pretrained(
            File "/ray_env/lib64/python3.8/site-packages/transformers/modeling_utils.py", line 3523, in from_pretrained
              model = quantizer.convert_model(model)
            File "/ray_env/lib64/python3.8/site-packages/optimum/gptq/quantizer.py", line 229, in convert_model
              self._replace_by_quant_layers(model, layers_to_be_replaced)
            File "/ray_env/lib64/python3.8/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
              self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
            File "/ray_env/lib64/python3.8/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
              self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
            File "/ray_env/lib64/python3.8/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
              self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
            [Previous line repeated 1 more time]
            File "/ray_env/lib64/python3.8/site-packages/optimum/gptq/quantizer.py", line 282, in _replace_by_quant_layers
              new_layer = QuantLinear(
            File "/ray_env/lib64/python3.8/site-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 68, in __init__
              assert outfeatures % 32 == 0
          AssertionError

The guys from AutoGPTQ say it's an issue with optimum....

Thank you in advance

Expected behavior

It is deployed without errors

hyaticua commented 8 months ago

I am encountering this issue with AutoGPTQ and Mixtral as well. I am seeing a similar error with AutoAWQ and Mixtral

ValueError: OC is not multiple of cta_N = 64

YeonwooSung commented 8 months ago

I am also facing with the same issue.. Any progress?

hyaticua commented 8 months ago

It seems like if you use AutoGPTQ/AutoAWQ directly you can get something working.

model = AutoGPTQForCausalLM.from_quantized(model_path, device="cuda:0")

model = AutoAWQForCausalLM.from_quantized(model_path)
paolovic commented 8 months ago

thank you @hyaticua , will give this a try

IlyasMoutawwakil commented 8 months ago

Hi, can you provide a minimal code to reproduce this issue ? and link to the original issue in AutoGPTQ

paolovic commented 8 months ago

Hi @IlyasMoutawwakil , https://github.com/AutoGPTQ/AutoGPTQ/issues/486 There is also a code snippet provided. I am almost certain using AutoGPTQForCausalLM will solve my problem, as soon as I have some time, I will provide a snippet myself.

DhruvaBansal00 commented 2 months ago

@IlyasMoutawwakil @hyaticua @paolovic Any updates on this issue? I think its quite important for us to be able to load GPTQ models successfully using AutoModelForCausalLM.from_pretrained.