can't forward 4bit nllb-moe-54b (RuntimeError: result type Float can't be cast to the desired output type Byte)

System Info

GPU: NVIDIA RTX A6000 (VRAM 48G) transformers version: 4.34.0 Platform: Linux 5.15.0-69-generic Python version: 3.8.10 Huggingface_hub version: 0.18.0 Safetensors version: 0.4.0 Accelerate version: 0.23.0 PyTorch version: 2.1.0+cu118 bitsandbytes version: 0.41.1

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

lang_map = {
    "ja": "jpn_Jpan",
    "zh": "zho_Hans",
}

model_path = 'facebook/nllb-moe-54b'
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
tokenizer.src_lang = lang_map["ja"]
tokenizer.tgt_lang = lang_map["zh"]

model = AutoModelForSeq2SeqLM.from_pretrained(
          model_path,
          load_in_4bit=True,
          torch_dtype=torch.float16,
          device_map="auto",
        )

forced_bos_token_id = tokenizer.lang_code_to_id[lang_map["zh"]]
model.config.forced_bos_token_id = forced_bos_token_id

generation_config = dict(
    repetition_penalty=1.2,
    do_sample=False,
    num_beams=5,
    num_return_sequences=1,
    max_new_tokens=512,
    pad_token_id=tokenizer.eos_token_id,
)

input_text = '米財務省は12日、連邦政府債務上限の到達後も支払い履行など資金をやりくりしてきた特別措置について、今月10日時点であと880億ドル（約11兆9400億円）しか残されていないことを明らかにした。'

encodings = tokenizer(input_text, truncation=True, max_length=512, return_tensors="pt").to('cuda')

with torch.no_grad():
    outputs = model.generate(input_ids=encodings["input_ids"], **generation_config)

preds = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(preds)

error message:

Traceback (most recent call last):
  File "t.py", line 39, in <module>
    outputs = self._model.generate(input_ids=encodings["input_ids"], **generation_config)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1496, in generate
    model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 661, in _prepare_encoder_decoder_kwargs_for_generation
    model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/nllb_moe/modeling_nllb_moe.py", line 1170, in forward
    layer_outputs = encoder_layer(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/nllb_moe/modeling_nllb_moe.py", line 702, in forward
    hidden_states, router_states = self.ffn(hidden_states, attention_mask)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/nllb_moe/modeling_nllb_moe.py", line 484, in forward
    expert_output *= 1 - self.moe_token_dropout
RuntimeError: result type Float can't be cast to the desired output type Byte

Expected behavior

translated text.

huggingface / transformers

can't forward 4bit nllb-moe-54b (RuntimeError: result type Float can't be cast to the desired output type Byte) #26898