`return_dict` does not working in `modeling_t5.py` , I set `return_dict==True` but return a turple

CaffreyR commented 1 year ago

System Info

transformers version: 4.22.1
Platform: Linux-5.13.0-48-generic-x86_64-with-debian-bullseye-sid
Python version: 3.7.13
Huggingface_hub version: 0.9.1
PyTorch version (GPU?): 1.12.1+cu102 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@patrickvonplaten Many thanks!

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I am using the code from facebook research FID, and I try to use this code

 for i, batch in enumerate(dataloader):
        (idx, labels, _, context_ids, context_mask) = batch
        outputs = model(
            input_ids=context_ids.cuda(),
            attention_mask=context_mask.cuda(),
            labels=labels.cuda(),
            return_dict=True,
            head_mask=head_mask,
            decoder_head_mask=decoder_head_mask
        )

And it report error!

  File "/home/user/anaconda3/envs/uw/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 1695, in forward
    encoder_last_hidden_state=encoder_outputs.last_hidden_state,
AttributeError: 'tuple' object has no attribute 'last_hidden_state'

So I went to this line to see the output of t5 encoder output https://github.com/huggingface/transformers/blob/v4.23.1/src/transformers/models/t5/modeling_t5.py#L1609

So I use this code

encoder_outputs = self.encoder(
                input_ids=input_ids,
                attention_mask=attention_mask,
                inputs_embeds=inputs_embeds,
                head_mask=head_mask,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
            )
print(type(encoder_outputs),"@@@",return_dict)

Expected behavior

It print <class 'tuple'> @@@ True, so I set return_dict==True but return a turple

gante commented 1 year ago

Hi @CaffreyR 👋 At a first glance at our code base, I don't see how that bug can arise 🤔 Can you share a script or a notebook where the issue can be reproduced?

CaffreyR commented 1 year ago

Hi @gante, yes of course! Many thanks! The code is here https://github.com/CaffreyR/FiD with little revision from https://github.com/facebookresearch/FiD. We can see our problem is here https://github.com/CaffreyR/FiD/blob/main/train_reader.py#L63.

The transformer version of this code is different from my experiment.(This is the script that is the easiest for you to produce). Please follow the steps on readme on https://github.com/facebookresearch/FiD#download-data to prepare the data(a bit large). And try to run

python train_reader.py \
        --use_checkpoint  \
        --train_data  open_domain_data/NQ/train.json \ # after we preparing the data
        --eval_data open_domain_data/NQ/dev.json\ # after we preparing the data
        --model_size base \
        --per_gpu_batch_size 1 \
        --n_context 100 \
        --name my_experiment \
        --checkpoint_dir checkpoint \

This data set is NaturalQuestions, it is little tricky to get the data prepared. So I am very grateful for your help!:)

Thank you very much!

gante commented 1 year ago

Hey @CaffreyR -- with a long script it's hard to pinpoint the issue :) We need a short reproducible script, otherwise we will not prioritize this issue.

CaffreyR commented 1 year ago

Hi @gante , it is very interesting that I try to use this code and it runs successfully. The batch is the same from FID, only the model is different. The original facebook code inherited and nested the t5 model.

import torch
import transformers
model = transformers.T5ForConditionalGeneration.from_pretrained('t5-base')
# model = src.model.FiDT5(t5.config)
# model.load_t5(t5.state_dict())
context_ids=torch.tensor([[[  822,    10,     3,     9,   538,   213,  1442,  9481,  1936, 10687,
            999,  2233,    10,  1862, 12197,    16,  1547,  2625,    10,  1862,
          12197,    16,  1547,    37,  1862, 12197,    16,  1547,  2401,     7,
             12,     3,     9,  1059,   116,  2557, 11402,    47, 12069,   139,
             46,  2913,   358,   788,    12,     8,  9284,    13,   941,  2254,
             11,   748,   224,    38,     8,   169,    13,   306,  6339,    53,
           1196,    41, 15761,   553,    61,  7299,     6,     3, 29676,     6,
          21455,  2465,     6,  6256,  9440,     7,     6,    11, 20617,   277,
              5,   100,    47,   294,    13,     8,  2186,  1862,  9481, 14310,
          16781,    57, 13615,  7254,    40,   402,   122,     6,    84, 11531,
             26, 10687,   585,    11,   748,    12,   993, 10687,  7596,    16,
              8,  2421,   296,     5,    37,  1862, 12197,   441,  1547,     3,
          28916,    16,     8,   778,  8754,     7,    24,  2237,    12,    46,
            993,    16,   542,  8273,   999,     6,   902,    16, 27864,     6,
           3504, 21247,     6,    11, 31251, 22660,     5,     1,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0]]])

labels=torch.tensor([[1547,    1]])
context_mask=torch.tensor([[[ True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True,  True,  True,  True,  True,  True,  True, False, False,
          False, False, False, False, False, False, False, False, False, False,
          False, False, False, False, False, False, False, False, False, False,
          False, False, False, False, False, False, False, False, False, False,
          False, False, False, False, False, False, False, False, False, False,
          False, False, False, False, False, False, False, False, False, False]]])

# print(context_ids)
# print(labels)
# print(context_mask)
n_layers, n_heads = 12, 12
head_importance = torch.zeros(n_layers, n_heads).to('cpu')
attn_entropy = torch.zeros(n_layers, n_heads).to('cpu')
head_mask = torch.ones(n_layers, n_heads).to('cpu')
head_mask.requires_grad_(requires_grad=True)
decoder_head_mask = torch.ones(n_layers, n_heads).to('cpu')
decoder_head_mask.requires_grad_(requires_grad=True)

if context_ids != None:
    # inputs might have already be resized in the generate method
    # if context_ids.dim() == 3:
    #     self.encoder.n_passages = context_ids.size(1)
    context_ids = context_ids.view(context_ids.size(0), -1)
if context_mask != None:
    context_mask = context_mask.view(context_mask.size(0), -1)

outputs = model.forward(
                input_ids=context_ids,
                attention_mask=context_mask,
                labels=labels,
                return_dict=True,
                head_mask=head_mask,
                decoder_head_mask=decoder_head_mask
            )

# outputs = model(
#                 input_ids=context_ids.cuda(),
#                 attention_mask=context_mask.cuda(),
#                 labels=labels.cuda(),
#                 return_dict=True,
#                 head_mask=head_mask.cuda(),
#                 decoder_head_mask=decoder_head_mask.cuda()
#             )
print(outputs)

It might be the problem of inheriting, I don't know, it just different when I try to simplify the code. :(

    def forward(self, input_ids=None, attention_mask=None, **kwargs):
        if input_ids != None:
            # inputs might have already be resized in the generate method
            if input_ids.dim() == 3:
                self.encoder.n_passages = input_ids.size(1)
            input_ids = input_ids.view(input_ids.size(0), -1)
        if attention_mask != None:
            attention_mask = attention_mask.view(attention_mask.size(0), -1)
        return super().forward(
            input_ids=input_ids,
            attention_mask=attention_mask,
            **kwargs
        )

gante commented 1 year ago

@CaffreyR then it's almost surely an upstream problem -- I noticed it uses transformers==3.0.2, which may explain the issue you're seeing :)

While I can't provide support in these situations (the problem is not present in transformers), my advice would be to open an issue in FID and/or to try to monkey-patch their problematic model code.

CaffreyR commented 1 year ago

OK then, I will give it a try ! Thanks!!!

huggingface / transformers