Contrastive Search in .generate() function doesn't work with Half

sam-ulrich1 commented 1 year ago

System Info

The CLI fails but this is irrelevant to the problem

Who can help?

@gante

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Load any model like so

model = AutoModelForCausalLM.from_pretrained(
"<PATH>",
torch_dtype=torch.float16,
)

Perform generation using contrastive search

gen_tokens = model.generate(
tokenized_input.input_ids,
top_k=4,
penalty_alpha=0.6
)

Expected behavior

Contrastive search probably should work with torch.float16 (if not just let me know - idk if there are stability issues).

This can be fixed by adding the following code to https://github.com/huggingface/transformers/blob/25ddd91b249014d818fb2ed3d4ba856ed9a5653e/src/transformers/generation/utils.py#L1873

# conditionally convert from float16
if context_hidden.dtype == torch.float16:
    context_hidden = context_hidden.to(dtype=torch.float32)
if next_hidden.dtype == torch.float16:
    next_hidden = next_hidden.to(dtype=torch.float32)
if top_k_probs.dtype == torch.float16:
    top_k_probs = top_k_probs.to(dtype=torch.float32)

gante commented 1 year ago

Hey @sam-ulrich1 👋

To be candid, fp16 was not a concern when writing contrastive search :) I've tried adding your suggested change and running the script below, but that was not enough to fix it

from transformers import GPT2Tokenizer, OPTForCausalLM
import torch

tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt-350m", padding_side='left')
model = OPTForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.float16)

inputs = tokenizer(["My cat is"], return_tensors="pt")

outputs = model.generate(**inputs, top_k=4, penalty_alpha=0.6)
print(tokenizer.batch_decode(outputs.sequences))

Would you be able to share a snippet of what you're trying to run? :)

sam-ulrich1 commented 1 year ago

Odd! It works on my machine (pun intended)!

Let me get my version and other info and I can make a PR if you'd like. That way we can work from code not just snippets

sam-ulrich1 commented 1 year ago

@gante Could you share your traceback? I'll take a look at this later today

gante commented 1 year ago

@sam-ulrich1 haha roles reversed, usually I'm the one asking for tracebacks!

Traceback (most recent call last):
  File "/home/joao/transformers/../joao_scripts/dbg.py", line 17, in <module>
    outputs = model.generate(**inputs, top_k=4, penalty_alpha=0.6)
  File "/home/joao/hf/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/joao/transformers/src/transformers/generation/utils.py", line 1372, in generate
    return self.contrastive_search(
  File "/home/joao/hf/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/joao/transformers/src/transformers/generation/utils.py", line 1769, in contrastive_search
    outputs = self(
  File "/home/joao/hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/joao/transformers/src/transformers/models/opt/modeling_opt.py", line 934, in forward
    outputs = self.model.decoder(
  File "/home/joao/hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/joao/transformers/src/transformers/models/opt/modeling_opt.py", line 645, in forward
    inputs_embeds = self.project_in(inputs_embeds)
  File "/home/joao/hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/joao/hf/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

sam-ulrich1 commented 1 year ago

Ya I got a kick out of that too!

It actually looks like that is an OPT issue with Half. I'm playing around with CodeGen so that would be my reference but I know other models are affected as well. Currently the problem I'm targeting is "baddbmm_with_gemm" not implemented for 'Half'

I'll take a look at the OPT thing as well but if it's out of scope I'll probably start another issue to keep the tracking simple.

sam-ulrich1 commented 1 year ago

@gante I'm not gonna get this done today but I'll get it knocked out by the end of the week. I just have a bit busier week than I expected

gante commented 1 year ago

@sam-ulrich1 no worries :) and let me know if you need a hand!

sam-ulrich1 commented 1 year ago

@gante How do I run the tests in the repo? I added the below test at the below link so that I can validate my fix. I want to run this test on the CodeGen model but I've never worked with a testing setup like this https://github.com/huggingface/transformers/blob/0359e2e15f4504513fd2995bdd6dd654c747b313/tests/generation/test_utils.py#L1432

    def test_contrastive_generate_fp16(self):
        # check `generate()` and `contrastive_search()` are equal
        for model_class in self.all_generative_model_classes:

            # won't fix: FSMT and Reformer have a different cache variable type (and format).
            if any(model_name in model_class.__name__.lower() for model_name in ["fsmt", "reformer"]):
                return

            config, input_ids, attention_mask, max_length = self._get_input_ids_and_config()

            # NOTE: contrastive search only works with cache on at the moment.
            if not hasattr(config, "use_cache"):
                return
            config.use_cache = True
            config.is_decoder = True
            config.torch_dtype = torch.float16

            # test old generation output for backwards compatibility
            model = model_class(config).to(torch_device).eval()
            output_contrastive, output_generate = self._contrastive_generate(
                model=model, input_ids=input_ids, attention_mask=attention_mask, max_length=max_length
            )
            self.assertListEqual(output_contrastive.tolist(), output_generate.tolist())

gante commented 1 year ago

@sam-ulrich1 try py.test tests/ -k contrastive_generate_fp16 -vv, assuming you are in .../transformers/.

(tests/ is the folder containing the test files, -k filters tests by name, contrastive_generate_fp16 is the test name filter based on your test name)

sam-ulrich1 commented 1 year ago

Thanks!

sam-ulrich1 commented 1 year ago

@gante Okay it seems to be fixed but there is one model that fails the test for (what appears to be) a unrelated problem. What's the procedure for this? Can y'all accept a PR if all the tests don't pass?

Here's the failing model:

FAILED tests/models/git/test_modeling_git.py::GitModelTest::test_contrastive_generate_fp16 - RuntimeError: output with shape [10, 1, 1, 1] doesn't match the broadcast shape [10, 1, 1, 4]

And pytest stack trace+

___________________________________________________________________________________________________________ GitModelTest.test_contrastive_generate_fp16 ____________________________________________________________________________________________________________

self = <tests.models.git.test_modeling_git.GitModelTest testMethod=test_contrastive_generate_fp16>

    def test_contrastive_generate_fp16(self):
        # check `generate()` and `contrastive_search()` are equal
        for model_class in self.all_generative_model_classes:

            # won't fix: FSMT and Reformer have a different cache variable type (and format).
            if any(model_name in model_class.__name__.lower() for model_name in ["fsmt", "reformer"]):
                return

            config, input_ids, attention_mask, max_length = self._get_input_ids_and_config()

            # NOTE: contrastive search only works with cache on at the moment.
            if not hasattr(config, "use_cache"):
                return
            config.use_cache = True
            config.is_decoder = True
            config.torch_dtype = torch.float16
            print(config)

            # test old generation output for backwards compatibility
            model = model_class(config).to(torch_device).eval()
>           output_contrastive, output_generate = self._contrastive_generate(
                model=model, input_ids=input_ids, attention_mask=attention_mask, max_length=max_length
            )

tests/generation/test_utils.py:1453: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/generation/test_utils.py:655: in _contrastive_generate
    output_generate = model.generate(
../../../anaconda3/envs/transformers/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27: in decorate_context
    return func(*args, **kwargs)
src/transformers/generation/utils.py:1321: in generate
    return self.contrastive_search(
../../../anaconda3/envs/transformers/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27: in decorate_context
    return func(*args, **kwargs)
src/transformers/generation/utils.py:1804: in contrastive_search
    outputs = self(
../../../anaconda3/envs/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py:1194: in _call_impl
    return forward_call(*input, **kwargs)
src/transformers/models/git/modeling_git.py:1478: in forward
    outputs = self.git(
../../../anaconda3/envs/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py:1194: in _call_impl
    return forward_call(*input, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = GitModel(
  (embeddings): GitEmbeddings(
    (word_embeddings): Embedding(99, 32, padding_idx=98)
    (position_embedd...n_features=768, out_features=32, bias=True)
      (1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
    )
  )
)
input_ids = tensor([[36],
        [64],
        [41],
        [89],
        [58],
        [72],
        [41],
        [ 2],
        [36],
        [64]], device='cuda:0')
attention_mask = tensor([[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]], device='cuda:0')
position_ids = None, pixel_values = None, head_mask = [None, None, None, None, None], inputs_embeds = None, past_key_values = None, use_cache = True, output_attentions = False, output_hidden_states = True, return_dict = True

    @add_start_docstrings_to_model_forward(GIT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=_CONFIG_FOR_DOC)
    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        pixel_values: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
        r"""
        past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
        use_cache (`bool`, *optional*):
            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
            `past_key_values`).

        Returns:

        Examples:

        ```python
        >>> from transformers import AutoProcessor, AutoModel
        >>> import requests
        >>> from PIL import Image

        >>> processor = AutoProcessor.from_pretrained("microsoft/git-base")
        >>> model = AutoModel.from_pretrained("microsoft/git-base")

        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)

        >>> text = "this is an image of two cats"

        >>> inputs = processor(text, images=image, return_tensors="pt")

        >>> outputs = model(**inputs)
        >>> last_hidden_state = outputs.last_hidden_state
        ```"""
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        elif input_ids is not None:
            input_shape = input_ids.size()
        elif inputs_embeds is not None:
            input_shape = inputs_embeds.size()[:-1]
        else:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        seq_length = input_shape[1]

        # past_key_values_length
        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0

        # Prepare head mask if needed
        # 1.0 in head_mask indicate we keep the head
        # attention_probs has shape bsz x n_heads x N x N
        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)

        projected_visual_features = None
        if pixel_values is not None:
            if pixel_values.ndim == 4:
                # here we assume pixel_values is of shape (batch_size, num_channels, height, width)
                visual_features = self.image_encoder(pixel_values).last_hidden_state

            elif pixel_values.ndim == 5:
                # here we assume pixel_values is of shape (batch_size, num_frames, num_channels, height, width)
                visual_features = []
                for frame_idx in range(pixel_values.shape[1]):
                    visual_features_frame = self.image_encoder(pixel_values[:, frame_idx, :, :]).last_hidden_state
                    visual_features_frame += self.img_temperal_embedding[frame_idx]
                    visual_features.append(visual_features_frame)

                # finally, concatenate all features along sequence dimension
                visual_features = torch.cat(visual_features, dim=1)

            else:
                raise ValueError("pixel_values must be of rank 4 or 5")

            projected_visual_features = self.visual_projection(visual_features)

        embedding_output = self.embeddings(
            input_ids=input_ids,
            position_ids=position_ids,
            inputs_embeds=inputs_embeds,
            past_key_values_length=past_key_values_length,
        )

        if projected_visual_features is None:
            projected_visual_features = torch.zeros(
                (embedding_output.shape[0], 0, embedding_output.shape[2]),
                dtype=embedding_output.dtype,
                device=embedding_output.device,
            )

        # Repeat visual features to match embedding batch size.
        projected_visual_features = projected_visual_features.repeat(
            embedding_output.size(0) // projected_visual_features.size(0), 1, 1
        )

        # concatenate patch token and text token embeddings
        hidden_states = torch.cat((projected_visual_features, embedding_output), dim=1)

        # By default, an additive causal mask is created
        # for masking the future (one direction).
        tgt_mask = self._generate_future_mask(seq_length, embedding_output.dtype, embedding_output.device)

        # Create an attention mask of shape (batch_size, 1, tgt_seq_len, src_seq_len)
        combined_attention_mask = self.create_attention_mask(
            tgt=embedding_output,
            memory=projected_visual_features,
            tgt_mask=tgt_mask,
            past_key_values_length=past_key_values_length,
        )

        if attention_mask is not None:
            # if the user provides an attention mask, we add it to the default one
            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
            expanded_attn_mask = _expand_mask(attention_mask, embedding_output.dtype, tgt_len=input_shape[-1]).to(
                embedding_output.device
            )
            if past_key_values_length > 0:
                expanded_attn_mask = expanded_attn_mask[:, :, -past_key_values_length:, :]
            else:
>               combined_attention_mask[:, :, -input_shape[1] :, -input_shape[1] :] += expanded_attn_mask
E               RuntimeError: output with shape [10, 1, 1, 1] doesn't match the broadcast shape [10, 1, 1, 4]

gante commented 1 year ago

Oh yeah, GIT is a bit different -- it's a multimodal model that requires careful manipulations at generate time. Open a PR with what you have now, I think I can figure out what's wrong with GIT after I have access to the changes :)

ArthurZucker commented 1 year ago

Jumping here, the error RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' is just that Half only works on GPU and should not be used on cpu 😉

sam-ulrich1 commented 1 year ago

That would make a lot of sense! I didn't address that error in this fix. I focused on "baddbmm_with_gemm" not implemented for 'Half' but I can take a look at that error over the weekend if you'd like

sam-ulrich1 commented 1 year ago

@gante Fix is here rebased to latest commit on main but the PR guidelines are kinda long so I won't be able to create the PR until later https://github.com/gage-technologies/transformers

mallorbc commented 1 year ago

I am having this issue as well. I tried 4.26 and 4.25.1. I am gonna try @sam-ulrich1 solution.

mallorbc commented 1 year ago

The fix did not help. Neither using DeepSpeed nor using vanilla Transformers. Using bfloat16 gives me expected results(but I need float16 for DeepSpeed)

mallorbc commented 1 year ago

I take back what I said. I am not having this issue at all. With or withou t @sam-ulrich1 fix, it is working fine. The issue is with DeepSpeed.

apolinario commented 1 year ago

I'm also facing a similar issue:

generator = pipeline("text2text-generation", model="philschmid/flan-t5-xxl-sharded-fp16", model_kwargs={"load_in_8bit":True, "device_map": "auto"})
output = generator(prompt, penalty_alpha=0.6, top_k=4,  max_length=256)

Gives me the error:

RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'

So contrastive search seems not compatible with loading the model in 8-bit. Is that expected or a bug?

gante commented 1 year ago

@sam-ulrich1 do you have some updates on your end? I can open a PR from the changes in your fork, if you're interested :)

sam-ulrich1 commented 1 year ago

@gante Shoot! Sorry man this slipped my mind. Let me take a look.at the PR guidelines again and see if I can get mine rebased and prepped and if not then I'm happy to let you.

Thanks man!

apolinario commented 1 year ago

Just to flag, the error I faced here still exists with @sam-ulrich1's fix. Should I open a new Issue as this may be related specifically to 8-bit?

sam-ulrich1 commented 1 year ago

@gante I'm gonna look at this today. Sorry man, I've been slammed with work the past month

sam-ulrich1 commented 1 year ago

@gante If you want to just snag my changes go ahead otherwise I will eventually get to this it's just been a really tough few weeks

gante commented 1 year ago

BTW I'm not sure if this fix is still needed, I am unable to reproduce the issue on main.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2", torch_dtype=torch.float16).to("cuda")

inputs = tok(["This cat is"], return_tensors="pt").to("cuda")
gen_out = model.generate(**inputs, top_k=4, penalty_alpha=0.6)

If someone else comes across this issue, please let me know 🙏

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers