Inference bug of the MoE GPTQ models

bozheng-hit commented 6 months ago

System Info

Generating with GPTQ models encounters the following errors after merging this PR: https://github.com/huggingface/transformers/pull/30209 @younesbelkada @SunMarc

The error information is here, and the model successfully generates after I revert the change for modeling_qwen2_moe.py.

Traceback (most recent call last):
  File "/home/data/roy.zb/workspace/test_auto_gptq.py", line 23, in <module>
    generated_ids = model.generate(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/generation/utils.py", line 1656, in generate
    result = self._sample(
  File "/home/data/roy.zb/workspace/transformers/src/transformers/generation/utils.py", line 2819, in _sample
    outputs = self(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1355, in forward
    outputs = self.model(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1224, in forward
    layer_outputs = decoder_layer(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 934, in forward
    hidden_states = self.mlp(hidden_states)
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 856, in forward
    final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

The code to reproduce to error is here:

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

The error information is here:

Traceback (most recent call last):
  File "/home/data/roy.zb/workspace/test_auto_gptq.py", line 23, in <module>
    generated_ids = model.generate(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/generation/utils.py", line 1656, in generate
    result = self._sample(
  File "/home/data/roy.zb/workspace/transformers/src/transformers/generation/utils.py", line 2819, in _sample
    outputs = self(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1355, in forward
    outputs = self.model(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1224, in forward
    layer_outputs = decoder_layer(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 934, in forward
    hidden_states = self.mlp(hidden_states)
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 856, in forward
    final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Expected behavior

Output the following text:

A large language model is a type of artificial intelligence that is trained to understand and generate human language. These models are designed to process and comprehend natural language input, and can be used for a variety of tasks such as language translation, sentiment analysis, and chatbot development. They are typically very large neural networks that have been pre-trained on vast amounts of text data, allowing them to learn the nuances of language and make intelligent predictions about how to respond to different inputs. Large language models have become increasingly popular in recent years due to their ability to handle complex language tasks and their potential applications in fields such as customer service, content creation, and education.

amyeroberts commented 6 months ago

cc @younesbelkada @SunMarc

SunMarc commented 6 months ago

Hi @bozheng-hit, thanks for reporting ! I can indeed reproduce the error and it also happens to the mixtral models. I'm not sure what would be the best fix for now since adding back the top_x.shape[0] == 0: condition will break the fx tracing for qwen moe and mixtral and inference works fine on the original model. WDYT @amyeroberts @ArthurZucker ? LMK if you come up with a solution @bozheng-hit, I will try to find a solution too.

amyeroberts commented 6 months ago

@SunMarc Would having a conditional check e.g. if not is_tracing() and top_x.shape[0] == 0: work as a partial fix?

SunMarc commented 6 months ago

Thanks for the tip @amyeroberts but it doesn't work ;) . However, I tested the exllamav2 kernel and it works with it. The exllamav1 kernel must have some issues. A potential fix would be to change the quantization_config inside the config.json, so that users uses exllamav2 kernel by default. WDYT @bozheng-hit ? You would have to set version to 2 in exllama_config field here.

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

SunMarc commented 5 months ago

Closing this since I think that the issue is fixed !

huggingface / transformers