IDEFICS can't use inputs_embeds in generate function

aohenuo commented 4 days ago

System Info

transformers version: 4.45.2
Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.23.3
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: parallel
Using GPU in script?: yes
GPU type: NVIDIA RTX A6000

when I use inputs_embeds instead of input_ids, the idefics model's generate function return a error: """ You passed inputs_embeds to .generate(), but the model class IdeficsForVisionText2Text doesn't have its forwarding implemented. See the GPT2 implementation for an example (Generate: decoder-only models can generate with inputs_embeds by gante · Pull Request #21405 · hug), and feel free to open a PR with it! """ However, In IdeficsForVisionText2Text's defintation, I find the forward already have the inputs_embeds enabled. The following function is defined at line 1541 of the code:

class IdeficsForVisionText2Text(IdeficsPreTrainedModel):
    _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
    _tied_weights_keys = ["model.embed_tokens.weight", "lm_head.weight"]

    def __init__(self, config, vision_model=None):
        super().__init__(config)
        self.model = IdeficsModel(config)

        self.lm_head = IdeficsDecoupledLinear(
            in_features=config.hidden_size,
            out_features=config.vocab_size,
            out_additional_features=config.additional_vocab_size,
            bias=False,
            partially_freeze=config.freeze_lm_head,
        )

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.model.embed_tokens

    def set_input_embeddings(self, value):
        self.model.embed_tokens = value

    def get_output_embeddings(self):
        return self.lm_head

    def set_output_embeddings(self, new_embeddings):
        self.lm_head = new_embeddings

    def set_decoder(self, decoder):
        self.model = decoder

    def get_decoder(self):
        return self.model

    def tie_weights(self):
        """
        Overwrite `transformers.modeling_utils.PreTrainedModel.tie_weights` to handle the case of
        IdeficsDecoupledLinear and IdeficsDecoupledEmbedding.
        """
        output_embeddings = self.get_output_embeddings()
        input_embeddings = self.get_input_embeddings()

        if getattr(self.config, "tie_word_embeddings", True):
            output_embeddings.weight = input_embeddings.weight
            if input_embeddings.num_additional_embeddings > 0:
                assert output_embeddings.out_additional_features == input_embeddings.num_additional_embeddings
                output_embeddings.additional_fc.weight = input_embeddings.additional_embedding.weight

        if hasattr(output_embeddings, "out_features") and hasattr(input_embeddings, "num_embeddings"):
            output_embeddings.out_features = input_embeddings.num_embeddings
            if hasattr(output_embeddings, "out_additional_features") and hasattr(
                input_embeddings, "num_additional_embeddings"
            ):
                output_embeddings.out_additional_features = input_embeddings.num_additional_embeddings

    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
    @replace_return_docstrings(output_type=IdeficsCausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
    def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        pixel_values: Optional[torch.FloatTensor] = None,
        image_encoder_embeddings: Optional[torch.FloatTensor] = None,
        perceiver_embeddings: Optional[torch.FloatTensor] = None,
        image_attention_mask: Optional[torch.Tensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        interpolate_pos_encoding: Optional[bool] = False,
        return_dict: Optional[bool] = None,
        cache_position: Optional[torch.LongTensor] = None,
    ) -> Union[Tuple, IdeficsCausalLMOutputWithPast]:

So why can't this code just use generate to generate it, I'd be very grateful if solve this problem 🙏

Who can help?

@zucchini-nlp @patrickvonplaten

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

import torch
from transformers import IdeficsForVisionText2Text, AutoProcessor

device = "cuda:1" if torch.cuda.is_available() else "cpu"
# We feed to the model an arbitrary sequence of text strings and images. Images can be either URLs or PIL Images.
prompts = [
    [
        "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
        "In this picture from Asterix and Obelix, we can see"
    ],
]
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics-9b")
# --batched mode
inputs = processor(prompts, return_tensors="pt").to(device)
# --single sample mode
# inputs = processor(prompts[0], return_tensors="pt").to(device)

# Generation args
bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
inputs_embeds =  interface.model.model.embed_tokens(inputs["input_ids"])
inputs["input_ids"] = None

generated_ids = interface.generate(inputs_embeds = inputs_embeds, bad_words_ids=bad_words_ids, max_length=100)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
for i, t in enumerate(generated_text):
    print(f"{i}:\n{t}\n")

Expected behavior

It shouldn't crash

zucchini-nlp commented 4 days ago

Yes, you're right! The model accepts inputs embeds but it was not implemented for generate() specifically. Seems like our tests were silently skipping it. I'll work on it and open a PR today :)

Btw, any reason why passing embeds is more preferred in VLMs than input ids with pixel values? Just for my curiosity

aohenuo commented 4 days ago

Yes, you're right! The model accepts inputs embeds but it was not implemented for generate() specifically. Seems like our tests were silently skipping it. I'll work on it and open a PR today :)

Btw, any reason why passing embeds is more preferred in VLMs than input ids with pixel values? Just for my curiosity

First of all, thank you very much for helping to resolve this bug!

Recently, I have been working on research that utilizes PEFT to fine-tune multimodal models. This requires me to input only embeddings. My goal is to generate the desired results using the embeddings I input. For this, the generate function in the code needs to have the capability to perform autoregressive generation based on the input embeddings.

huggingface / transformers