Running a `forward` pass before `generate` with AWQ fused modules breaks it

IlyasMoutawwakil commented 7 months ago

System Info

transformers version: 4.36.2
Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.20.2
Safetensors version: 0.4.1
Accelerate version: 0.26.1
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: MEGATRON_LM
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- megatron_lm_config: {'megatron_lm_gradient_clipping': 1.0, 'megatron_lm_pp_degree': 1, 'megatron_lm_recompute_activations': True, 'megatron_lm_sequence_parallelism': False, 'megatron_lm_tp_degree': 2, 'megatron_lm_use_distributed_optimizer': True}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.1.2+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForCausalLM, AwqConfig, AutoTokenizer

awq_config = AwqConfig(do_fuse=True, fuse_max_seq_len=512)
model = AutoModelForCausalLM.from_pretrained(
    "casperhansen/tinyllama-1b-awq",
    quantization_config=awq_config,
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained("casperhansen/tinyllama-1b-awq")
input_ids = tokenizer("Hello, my dog is cute", return_tensors="pt").input_ids.to("cuda")

model.forward(input_ids)
model.generate(input_ids, max_new_tokens=100)

Expected behavior

code works if only generate is called but not if a forward pass precedes it. looking at the traceback:

Traceback (most recent call last):
  File "/workspace/llm-perf/test_.py", line 29, in <module>
    model.generate(input_ids, max_new_tokens=100)
  File "/home/user/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate
    return self.greedy_search(
  File "/home/user/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search
    outputs = self(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1181, in forward
    outputs = self.model(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1033, in forward
    attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
  File "/home/user/.local/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 372, in _prepare_4d_causal_attention_mask_for_sdpa
    expanded_4d_mask = attn_mask_converter.to_4d(
  File "/home/user/.local/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 136, in to_4d
    expanded_attn_mask = causal_4d_mask.masked_fill(expanded_attn_mask.bool(), torch.finfo(dtype).min)
RuntimeError: The size of tensor a (9) must match the size of tensor b (25) at non-singleton dimension 3

the problems seems to be related to the sdpa integration

ArthurZucker commented 7 months ago

cc @younesbelkada and @fxmarty, if they use static cache then that is expected. I might fix it in #27931

ArthurZucker commented 6 months ago

cc @younesbelkada 🤗

VictorSanh commented 5 months ago

was this fixed? i just ran into the same error here

younesbelkada commented 5 months ago

Thanks everyone ! I managed to repro and the fix should be : https://github.com/casper-hansen/AutoAWQ/pull/401 cc @casper-hansen

Note if you run a dummy forward pass before you need to explicitly pass use_cache=False:

from transformers import AutoModelForCausalLM, AwqConfig, AutoTokenizer

awq_config = AwqConfig(do_fuse=True, fuse_max_seq_len=512)
model = AutoModelForCausalLM.from_pretrained(
    "casperhansen/tinyllama-1b-awq",
    quantization_config=awq_config,
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained("casperhansen/tinyllama-1b-awq")
input_ids = tokenizer("Hello, my dog is cute", return_tensors="pt").input_ids.to("cuda")

model.forward(input_ids, use_cache=False)
model.generate(input_ids, max_new_tokens=100)

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

younesbelkada commented 4 months ago

Closing as fixed on the latest autoawq release (see message above), let me know if this issue is still relevant

huggingface / transformers