Pix2Struct: unable to overfit on a single training sample

arnaudstiegler commented 1 year ago

System Info

transformers version: 4.28.0
Platform: Linux-5.4.0-1037-aws-x86_64-with-glibc2.27
Python version: 3.9.16
Huggingface_hub version: 0.13.4
Safetensors version: 0.3.0
PyTorch version (GPU?): 1.13.0+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@younesbelkada

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Here's the minimal training loop:

import requests
from PIL import Image
from transformers import Pix2StructForConditionalGeneration, AutoProcessor
from torch.optim import AdamW
import torch

torch.manual_seed(42)

model = Pix2StructForConditionalGeneration.from_pretrained("google/pix2struct-base")
processor = AutoProcessor.from_pretrained("google/pix2struct-base")

dummy_target = "The model should overfit this sentence"
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

encoded_image = processor(images=image, return_tensors="pt")
encoded_text = processor(text=dummy_target, return_tensors='pt', max_length=20)
optimizer = AdamW(model.parameters(), lr=1e-4)

model.train()

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)
flattened_patches=encoded_image.flattened_patches.to(device)
attention_mask=encoded_image.attention_mask.to(device)
labels=encoded_text.input_ids.to(device)

for i in range(1000):
    outputs = model(
        flattened_patches=flattened_patches,
        attention_mask=attention_mask,
        labels=labels
                   )
    loss = outputs.loss

    loss.backward()

    optimizer.step()
    optimizer.zero_grad()
    if i % 50 == 0:
        model.eval()
        prediction = model.generate(
            flattened_patches=flattened_patches,
            attention_mask=attention_mask)
        print(f'step: {i} train_loss: {loss.item()} prediction: {processor.batch_decode(prediction)}')
        model.train()

Here's the output I got:

step: 0 train_loss: 8.259493827819824 prediction: ['<pad> <img_src=cropped-img-20180924']
step: 50 train_loss: 1.9695181846618652 prediction: ['<pad> The model should overfit this sentence should overfit this sentence should overfit this sentence should over']
step: 100 train_loss: 2.071323871612549 prediction: ['<pad> <The model should overfit this sentence should overfit this sentence should overfit this sentence should']
step: 150 train_loss: 2.0366554260253906 prediction: ['<pad> The model should overfit this sentence should overfit this sentence should overfit this sentence should over']
step: 200 train_loss: 1.8225889205932617 prediction: ['<pad> The model should overfit this sentence should overfit this sentence should overfit this sentence should over']
step: 250 train_loss: 1.6568734645843506 prediction: ['<pad> The model should overfit this sentence should overfit this sentence should overfit this sentence should over']
step: 300 train_loss: 1.6770282983779907 prediction: ['<pad> The model should overfit this sentence sentence should overfit this sentence sentence should overfit this sentence']
step: 350 train_loss: 1.688515067100525 prediction: ['<pad> The model should overfit this sentence sentence overfit this sentence sentence overfit this sentence sentence over']
step: 400 train_loss: 1.6118296384811401 prediction: ['<pad> The model should overfit this sentence should overfit this sentence should overfit this sentence should over']
step: 450 train_loss: 1.6204414367675781 prediction: ['<pad> The model should overfit this sentence sentence should overfit this sentence should overfit this sentence should']
step: 500 train_loss: 1.59645676612854 prediction: ['<pad> The model should overfit this sentence should overfit this sentence should overfit this sentence should over']
step: 550 train_loss: 1.5818239450454712 prediction: ['<pad> The model should overfit this sentence sentence sentence sentence sentence sentence sentence sentence sentence sentence sentence sentence sentence']
step: 600 train_loss: 1.5775129795074463 prediction: ['<pad> The model should overfit this sentence should overfit this sentence should overfit this sentence should over']
step: 650 train_loss: 1.561257243156433 prediction: ['<pad> The model should overfit this sentence should overfit this sentence should overfit this sentence should over']
step: 700 train_loss: 1.5319150686264038 prediction: ['<pad> The model should overfit this sentence should overfit this sentence should overfit this sentence should over']
step: 750 train_loss: 1.646193504333496 prediction: ['<pad> The model should overfit this sentence should overfit this sentence should overfit this sentence should over']
step: 800 train_loss: 1.533736228942871 prediction: ['<pad> The model should overfit this sentence should overfit this sentence should overfit this sentence should over']
step: 850 train_loss: 1.6203268766403198 prediction: ['<pad> The model should overfit this sentence should overfit this sentence should overfit this sentence should over']
step: 900 train_loss: 1.5132172107696533 prediction: ['<pad> The model should overfit this sentence sentence should overfit this sentence sentence should overfit this sentence']
step: 950 train_loss: 1.491452693939209 prediction: ['<pad> The model should overfit this sentence The model should overfit this sentence The model should overfit']

Expected behavior

I've been trying to fine-tune Pix2Struct starting from the base pretrained model, and have been unable to do so. The model collapses consistently and fails to overfit on that single training sample. I noticed a comment about this on the fine-tuning notebook: https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb

Let's train the model! Run the simply the cell below for training the model. We have observed that finding the best hyper-parameters was quite challenging and required a lot of trials and errors, as the model can easily enter in "collapse-model" (always predicting the same output, no matter the input) if the HP are not chosen correctly. In this example, we found out that using AdamW optimizer with lr=1e-5 seemed to be the best approach.

To dig a little deeper, I've been trying to train on a single training sample with a minimal training loop, and see whether the model was able to correctly learn that single training sample. It seems that it's not able to overfit on a single training sample after 1000 training steps. Unless I missed something in my training loop, that seems like a weird behavior and might be a symptom of a bug somewhere?

younesbelkada commented 1 year ago

Hi thanks for the detailed report, indeed this seems weird. I will have a look at it once I am back on Tuesday. cc also @NielsRogge and @nbroad1881 for visibility as they have been also working on fine-tuning Pix2struct

arnaudstiegler commented 1 year ago

Thank you! Let me know if there's anything I can help with :)

NielsRogge commented 1 year ago

Yeah I had a hard time fine-tuning Pix2Struct myself. However looking at your code snippet, when you encode the target sequence:

from transformers import Pix2StructProcessor

processor = Pix2StructProcessor.from_pretrained("google/pix2struct-base")

dummy_target = "The model should overfit this sentence"
encoded_text = processor(text=dummy_target, return_tensors='pt', max_length=20)

then when decoding back to text:

processor.decode(encoded_text.input_ids.squeeze())

prints:

'The model should overfit this sentence'

So this target sequence doesn't contain an EOS (end-of-sequence) token nor a BOS (beginning-of-sequence) token. Hence, when generating text using the generate() method, it will just continue predicting tokens, at this method only stops generating text when the model predicts the EOS token. As the model is trained to not produce the EOS token, it simply will keep on generating text (hence you're getting ' The model should overfit this sentence should overfit this sentence' etc.). Also it looks like the first token is <pad> since the model's BOS token is equal to the pad token, so you'll need to add skip_special_tokens=True to the batch_decode method.

So cc @younesbelkada we'll need to check that, in case the user sets the max length to 20, then the tokenizer should set the EOS token as last token appropriately. It looks like the processor's tokenizer has this set:

>>> processor.tokenizer.eos_token
'</s>'

arnaudstiegler commented 1 year ago

Oh yeah, you're right! Completely missed it, and it does solve the generation issue after 50 steps basically.

step: 0 train_loss: 8.3875150680542 prediction: ['<pad> <img_alt=Tokyo is the cure for everything. img_src=']
step: 50 train_loss: 2.020235300064087 prediction: ['<pad> The model should overfit this sentence</s>']
step: 100 train_loss: 2.0110490322113037 prediction: ['<pad> The model should overfit this sentence</s>']
step: 150 train_loss: 1.728605031967163 prediction: ['<pad> The model should overfit this sentence</s>']
step: 200 train_loss: 1.678179144859314 prediction: ['<pad> The model should overfit this sentence</s>']
step: 250 train_loss: 1.6586235761642456 prediction: ['<pad> The model should overfit this sentence</s>']
step: 300 train_loss: 1.6816842555999756 prediction: ['<pad> The model should overfit this sentence</s>']
step: 350 train_loss: 1.6198171377182007 prediction: ['<pad> The model should overfit this sentence</s>']
step: 400 train_loss: 1.6187334060668945 prediction: ['<pad> The model should overfit this sentence</s>']
step: 450 train_loss: 1.6846977472305298 prediction: ['<pad> The model should overfit this sentence</s>']
step: 500 train_loss: 1.6047543287277222 prediction: ['<pad> The model should overfit this sentence</s>']
step: 550 train_loss: 1.585425853729248 prediction: ['<pad> The model should overfit this sentence</s>']
step: 600 train_loss: 1.5750995874404907 prediction: ['<pad> The model should overfit this sentence</s>']
step: 650 train_loss: 1.5516695976257324 prediction: ['<pad> The model should overfit this sentence</s>']
step: 700 train_loss: 1.5205081701278687 prediction: ['<pad> The model should overfit this sentence</s>']
step: 750 train_loss: 1.600045919418335 prediction: ['<pad> The model should overfit this sentence</s>']
step: 800 train_loss: 1.5451548099517822 prediction: ['<pad> The model should overfit this sentence</s>']
step: 850 train_loss: 1.602522373199463 prediction: ['<pad> The model should overfit this sentence</s>']

I think what remains weird is that the loss doesn't decrease below 1.5 even with that single training sample.

Anecdotally, I've been trying to fine-tune for some information extraction tasks, and I haven't been able to make it properly learn anything (I did check that there's an eos token in my labels when fine-tuning :) )

NielsRogge commented 1 year ago

Indeed, the loss should go down to 0. I notice 2 things here:

I see label smoothing is used which is pretty uncommon: https://github.com/huggingface/transformers/blob/7579a52b55611ba7651b6d05cba6f45539a6089d/src/transformers/models/pix2struct/modeling_pix2struct.py#L1557 According to PyTorch's docs: "The targets become a mixture of the original ground truth and a uniform distribution" Might explain this behaviour. @younesbelkada I assume you included this to comply to the original implementation?
this line should be removed: it's the user's responsability to set the labels to -100 for padding tokens. To comply to the design of any other model in the library, this line should not be there

arnaudstiegler commented 1 year ago

Good catch, just tried without the label smoothing and the losses now look much more normal:

step: 0 train_loss: 7.458827972412109 prediction: ['<pad> <img_alt=Towards a New Vision: A Vision for a New World Order']
step: 50 train_loss: 0.12852047383785248 prediction: ['<pad> The model should overfit this sentence</s>']
step: 100 train_loss: 0.010209576226770878 prediction: ['<pad> The Model should overfit this sentence</s>']
step: 150 train_loss: 0.0012781125260517001 prediction: ['<pad> The model should overfit this sentence</s>']
step: 200 train_loss: 0.014641670510172844 prediction: ['<pad> The model should overfit this sentence</s>']
step: 250 train_loss: 6.366522575262934e-05 prediction: ['<pad> The model should overfit this sentence</s>']
step: 300 train_loss: 0.0005338654736988246 prediction: ['<pad> The model should overfit this sentence</s>']
step: 350 train_loss: 0.004032869823276997 prediction: ['<pad> The model should overfit this sentence</s>']
step: 400 train_loss: 3.196050602127798e-05 prediction: ['<pad> The model should overfit this sentence</s>']
step: 450 train_loss: 1.0058114639832638e-05 prediction: ['<pad> The model should overfit this sentence</s>']
step: 500 train_loss: 1.513927782070823e-05 prediction: ['<pad> The model should overfit this sentence</s>']
step: 550 train_loss: 4.767631980939768e-05 prediction: ['<pad> The model should overfit this sentence</s>']
step: 600 train_loss: 0.005966411903500557 prediction: ['<pad> The model should overfit this sentence</s>']
step: 650 train_loss: 9.983758673115517e-07 prediction: ['<pad> The model should overfit this sentence</s>']
step: 700 train_loss: 2.6761419576359913e-05 prediction: ['<pad> The model should overfit this sentence</s>']
step: 750 train_loss: 0.03052591346204281 prediction: ['<pad> The model should overfit this sentence</s>']
step: 800 train_loss: 0.00021442778233904392 prediction: ['<pad> The model should overfit this sentence</s>']
step: 850 train_loss: 4.1449759009992704e-05 prediction: ['<pad> The model should overfit this sentence</s>']
step: 900 train_loss: 0.0005854590563103557 prediction: ['<pad> The model should overfit this sentence</s>']
step: 950 train_loss: 6.643687083851546e-05 prediction: ['<pad> The model should overfit this sentence</s>']

NielsRogge commented 1 year ago

Damn not sure why I didn't check the code of the loss calculation before training a model myself 🙈 hopefully this will also solve the fine-tuning runs on larger datasets

arnaudstiegler commented 1 year ago

Trying it right now! Will keep you updated once I got the results back :)

arnaudstiegler commented 1 year ago

From my experiment, the training loss on larger datasets is indeed getting much lower (expected) but it doesn't seem to be solving the issue.

younesbelkada commented 1 year ago

Thanks everyone for digging into that, I feel we are closing solving the issue, so I propose we first address

https://github.com/huggingface/transformers/issues/22903#issuecomment-1518275840

Into a PR, so that at least the loss behaves more "normally". @arnaudstiegler , how much lower does the loss decreases compared than previous runs? Any curves/stats you can share? Thinking it loud, I was wondering if your ultimate issue is not a hyper parameter issue.

arnaudstiegler commented 1 year ago

Losses overall look okay (with and without the label smoothing), but there seems to be some disconnect between the loss (both training and validation) value I'm getting and the actual quality of the predicted string. A priori, that might indicate a bug somewhere in my training workflow, but I did check it thoroughly. I also did a bunch of experiments on a single training batch, and as you reported in the notebook, the model can collapse with the wrong hyperparameters, esp. if the target is a long string. Adding some warmup seems to help, but it still behaves in a surprising way even on a single training sample.

I'm actually trying to swap out Donut for Pix2Struct, and the Donut model hasn't shown any of the behavior or brittleness I'm seeing with Pix2Struct. You're probably right that there might be some hyperparameter issue, but given the "limited" size of the model, I'm really surprised that it's so sensitive to HPs. Would love to hear other people experience with fine-tuning Pix2Struct

gbarello-uipath commented 1 year ago

I have also been trying to finetune pix2struct. I find that the losses go to zero very quickly which made me suspect that the attention masks are not being set properly.

What I see is that in the Pix2StructText module, self.config.is_decoder is set to False, causing this line to output a non-causal attention mask.

If I add the line self.config.is_decoder = True to the line above that to force it to be a decoder things look more normal.

younesbelkada commented 1 year ago

Interesting! @arnaudstiegler can you try on your side this potential fix and let us know how it goes?

arnaudstiegler commented 1 year ago

Yeah, the model seems to be learning well on >3k images dataset with the change on the decoder config. This seems to be the root cause. Really good catch @gbarello-uipath :)

gbarello-uipath commented 1 year ago

Glad its working for you @arnaudstiegler!

I don't have a lot of experience in the guts of the transformers repo (hence my hacky fix inside the forward function :) - could someone point me to the "right" place to make that fix? I looked into the configuration_pix2struct.py file, but haven't found the time yet to really dig down and actually fix it properly.

younesbelkada commented 1 year ago

This is really cool! @gbarello-uipath , I believe you would need to add is_decoder=True key word argument here: https://github.com/huggingface/transformers/blob/c2c99dc7ef5edab8f7674a1eb00cf6ac6996fd0f/src/transformers/models/pix2struct/configuration_pix2struct.py#L121 And also add it here as well (is_decoder=is_decoder) to fix the failing CI issues: https://github.com/huggingface/transformers/blob/c2c99dc7ef5edab8f7674a1eb00cf6ac6996fd0f/src/transformers/models/pix2struct/configuration_pix2struct.py#L147 Then get_attention_mask should be called properly as expected. I would also advise to double check again everything works just in case

younesbelkada commented 1 year ago

Let us know when you will open a Pull Request for that! Otherwise happy to do it as well

gbarello-uipath commented 1 year ago

I would love to be an official contributor, even if its just a one-line code change 😅 I will put together a PR shortly.

younesbelkada commented 1 year ago

Awesome! Thanks again for the fix

gbarello-uipath commented 1 year ago

Ok so I am working on this PR. It works fine when instantiating a brand new model, but when loading any of the pretrained models the is_decoder=False flag is saved in them already so the default kwarg gets overwritten.

I suppose there isn't really a way for me to fix that directly. Only thing I can think of is to load the model, manually fix the config, and then push that new model to the hub. Is that the best way to fix the pretrained models?

younesbelkada commented 1 year ago

I see, the other solution would probably to update the get_extended_mask method to accept a new optional argument to force the decoder-lik behavior , but I am not sure if this is the right fix. If the only solution is to update the models that are on the Hub I am happy to update them, cc @sgugger

sgugger commented 1 year ago

I think the pretrained model configs should be fixed directly.

gbarello-uipath commented 1 year ago

Ok @younesbelkada I created the PR: https://github.com/huggingface/transformers/pull/23051

Hopefully I have done everything correctly :)

If there is a way for me to also fix the pre-trained model configs let me know, otherwise let me know when they are fixed!

younesbelkada commented 1 year ago

Let's close this issue as we merged #23051 ! @NielsRogge has also made a nice tutorial in https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Pix2Struct Thanks everyone

alejopaullier96 commented 1 year ago

@younesbelkada I shared a notebook on how to train Matcha/Pix2Struct model for Kaggle's Benetech competition, in case anyone is interested. This model achieved silver zone and includes the updates with the fix.

younesbelkada commented 1 year ago

Thanks very much for sharing! It is really cool to see Matcha/Pix2Struct being using for winning notebooks in major kaggle competitions 🔥

huggingface / transformers