Closed arnaudstiegler closed 1 year ago
Hi thanks for the detailed report, indeed this seems weird. I will have a look at it once I am back on Tuesday. cc also @NielsRogge and @nbroad1881 for visibility as they have been also working on fine-tuning Pix2struct
Thank you! Let me know if there's anything I can help with :)
Yeah I had a hard time fine-tuning Pix2Struct myself. However looking at your code snippet, when you encode the target sequence:
from transformers import Pix2StructProcessor
processor = Pix2StructProcessor.from_pretrained("google/pix2struct-base")
dummy_target = "The model should overfit this sentence"
encoded_text = processor(text=dummy_target, return_tensors='pt', max_length=20)
then when decoding back to text:
processor.decode(encoded_text.input_ids.squeeze())
prints:
'The model should overfit this sentence'
So this target sequence doesn't contain an EOS (end-of-sequence) token nor a BOS (beginning-of-sequence) token. Hence, when generating text using the generate()
method, it will just continue predicting tokens, at this method only stops generating text when the model predicts the EOS token. As the model is trained to not produce the EOS token, it simply will keep on generating text (hence you're getting '<pad>
since the model's BOS token is equal to the pad token, so you'll need to add skip_special_tokens=True
to the batch_decode
method.
So cc @younesbelkada we'll need to check that, in case the user sets the max length to 20, then the tokenizer should set the EOS token as last token appropriately. It looks like the processor's tokenizer has this set:
>>> processor.tokenizer.eos_token
'</s>'
Oh yeah, you're right! Completely missed it, and it does solve the generation issue after 50 steps basically.
step: 0 train_loss: 8.3875150680542 prediction: ['<pad> <img_alt=Tokyo is the cure for everything. img_src=']
step: 50 train_loss: 2.020235300064087 prediction: ['<pad> The model should overfit this sentence</s>']
step: 100 train_loss: 2.0110490322113037 prediction: ['<pad> The model should overfit this sentence</s>']
step: 150 train_loss: 1.728605031967163 prediction: ['<pad> The model should overfit this sentence</s>']
step: 200 train_loss: 1.678179144859314 prediction: ['<pad> The model should overfit this sentence</s>']
step: 250 train_loss: 1.6586235761642456 prediction: ['<pad> The model should overfit this sentence</s>']
step: 300 train_loss: 1.6816842555999756 prediction: ['<pad> The model should overfit this sentence</s>']
step: 350 train_loss: 1.6198171377182007 prediction: ['<pad> The model should overfit this sentence</s>']
step: 400 train_loss: 1.6187334060668945 prediction: ['<pad> The model should overfit this sentence</s>']
step: 450 train_loss: 1.6846977472305298 prediction: ['<pad> The model should overfit this sentence</s>']
step: 500 train_loss: 1.6047543287277222 prediction: ['<pad> The model should overfit this sentence</s>']
step: 550 train_loss: 1.585425853729248 prediction: ['<pad> The model should overfit this sentence</s>']
step: 600 train_loss: 1.5750995874404907 prediction: ['<pad> The model should overfit this sentence</s>']
step: 650 train_loss: 1.5516695976257324 prediction: ['<pad> The model should overfit this sentence</s>']
step: 700 train_loss: 1.5205081701278687 prediction: ['<pad> The model should overfit this sentence</s>']
step: 750 train_loss: 1.600045919418335 prediction: ['<pad> The model should overfit this sentence</s>']
step: 800 train_loss: 1.5451548099517822 prediction: ['<pad> The model should overfit this sentence</s>']
step: 850 train_loss: 1.602522373199463 prediction: ['<pad> The model should overfit this sentence</s>']
I think what remains weird is that the loss doesn't decrease below 1.5 even with that single training sample.
Anecdotally, I've been trying to fine-tune for some information extraction tasks, and I haven't been able to make it properly learn anything (I did check that there's an eos token in my labels when fine-tuning :) )
Indeed, the loss should go down to 0. I notice 2 things here:
Good catch, just tried without the label smoothing and the losses now look much more normal:
step: 0 train_loss: 7.458827972412109 prediction: ['<pad> <img_alt=Towards a New Vision: A Vision for a New World Order']
step: 50 train_loss: 0.12852047383785248 prediction: ['<pad> The model should overfit this sentence</s>']
step: 100 train_loss: 0.010209576226770878 prediction: ['<pad> The Model should overfit this sentence</s>']
step: 150 train_loss: 0.0012781125260517001 prediction: ['<pad> The model should overfit this sentence</s>']
step: 200 train_loss: 0.014641670510172844 prediction: ['<pad> The model should overfit this sentence</s>']
step: 250 train_loss: 6.366522575262934e-05 prediction: ['<pad> The model should overfit this sentence</s>']
step: 300 train_loss: 0.0005338654736988246 prediction: ['<pad> The model should overfit this sentence</s>']
step: 350 train_loss: 0.004032869823276997 prediction: ['<pad> The model should overfit this sentence</s>']
step: 400 train_loss: 3.196050602127798e-05 prediction: ['<pad> The model should overfit this sentence</s>']
step: 450 train_loss: 1.0058114639832638e-05 prediction: ['<pad> The model should overfit this sentence</s>']
step: 500 train_loss: 1.513927782070823e-05 prediction: ['<pad> The model should overfit this sentence</s>']
step: 550 train_loss: 4.767631980939768e-05 prediction: ['<pad> The model should overfit this sentence</s>']
step: 600 train_loss: 0.005966411903500557 prediction: ['<pad> The model should overfit this sentence</s>']
step: 650 train_loss: 9.983758673115517e-07 prediction: ['<pad> The model should overfit this sentence</s>']
step: 700 train_loss: 2.6761419576359913e-05 prediction: ['<pad> The model should overfit this sentence</s>']
step: 750 train_loss: 0.03052591346204281 prediction: ['<pad> The model should overfit this sentence</s>']
step: 800 train_loss: 0.00021442778233904392 prediction: ['<pad> The model should overfit this sentence</s>']
step: 850 train_loss: 4.1449759009992704e-05 prediction: ['<pad> The model should overfit this sentence</s>']
step: 900 train_loss: 0.0005854590563103557 prediction: ['<pad> The model should overfit this sentence</s>']
step: 950 train_loss: 6.643687083851546e-05 prediction: ['<pad> The model should overfit this sentence</s>']
Damn not sure why I didn't check the code of the loss calculation before training a model myself 🙈 hopefully this will also solve the fine-tuning runs on larger datasets
Trying it right now! Will keep you updated once I got the results back :)
From my experiment, the training loss on larger datasets is indeed getting much lower (expected) but it doesn't seem to be solving the issue.
Thanks everyone for digging into that, I feel we are closing solving the issue, so I propose we first address
https://github.com/huggingface/transformers/issues/22903#issuecomment-1518275840
Into a PR, so that at least the loss behaves more "normally". @arnaudstiegler , how much lower does the loss decreases compared than previous runs? Any curves/stats you can share? Thinking it loud, I was wondering if your ultimate issue is not a hyper parameter issue.
Losses overall look okay (with and without the label smoothing), but there seems to be some disconnect between the loss (both training and validation) value I'm getting and the actual quality of the predicted string. A priori, that might indicate a bug somewhere in my training workflow, but I did check it thoroughly. I also did a bunch of experiments on a single training batch, and as you reported in the notebook, the model can collapse with the wrong hyperparameters, esp. if the target is a long string. Adding some warmup seems to help, but it still behaves in a surprising way even on a single training sample.
I'm actually trying to swap out Donut for Pix2Struct, and the Donut model hasn't shown any of the behavior or brittleness I'm seeing with Pix2Struct. You're probably right that there might be some hyperparameter issue, but given the "limited" size of the model, I'm really surprised that it's so sensitive to HPs. Would love to hear other people experience with fine-tuning Pix2Struct
I have also been trying to finetune pix2struct. I find that the losses go to zero very quickly which made me suspect that the attention masks are not being set properly.
What I see is that in the Pix2StructText
module, self.config.is_decoder
is set to False
, causing this line to output a non-causal attention mask.
If I add the line self.config.is_decoder = True
to the line above that to force it to be a decoder things look more normal.
Interesting! @arnaudstiegler can you try on your side this potential fix and let us know how it goes?
Yeah, the model seems to be learning well on >3k images dataset with the change on the decoder config. This seems to be the root cause. Really good catch @gbarello-uipath :)
Glad its working for you @arnaudstiegler!
I don't have a lot of experience in the guts of the transformers repo (hence my hacky fix inside the forward function :) - could someone point me to the "right" place to make that fix? I looked into the configuration_pix2struct.py
file, but haven't found the time yet to really dig down and actually fix it properly.
This is really cool!
@gbarello-uipath , I believe you would need to add is_decoder=True
key word argument here: https://github.com/huggingface/transformers/blob/c2c99dc7ef5edab8f7674a1eb00cf6ac6996fd0f/src/transformers/models/pix2struct/configuration_pix2struct.py#L121
And also add it here as well (is_decoder=is_decoder
) to fix the failing CI issues: https://github.com/huggingface/transformers/blob/c2c99dc7ef5edab8f7674a1eb00cf6ac6996fd0f/src/transformers/models/pix2struct/configuration_pix2struct.py#L147
Then get_attention_mask
should be called properly as expected. I would also advise to double check again everything works just in case
Let us know when you will open a Pull Request for that! Otherwise happy to do it as well
I would love to be an official contributor, even if its just a one-line code change 😅 I will put together a PR shortly.
Awesome! Thanks again for the fix
Ok so I am working on this PR. It works fine when instantiating a brand new model, but when loading any of the pretrained models the is_decoder=False
flag is saved in them already so the default kwarg gets overwritten.
I suppose there isn't really a way for me to fix that directly. Only thing I can think of is to load the model, manually fix the config, and then push that new model to the hub. Is that the best way to fix the pretrained models?
I see, the other solution would probably to update the get_extended_mask
method to accept a new optional argument to force the decoder-lik behavior , but I am not sure if this is the right fix. If the only solution is to update the models that are on the Hub I am happy to update them, cc @sgugger
I think the pretrained model configs should be fixed directly.
Ok @younesbelkada I created the PR: https://github.com/huggingface/transformers/pull/23051
Hopefully I have done everything correctly :)
If there is a way for me to also fix the pre-trained model configs let me know, otherwise let me know when they are fixed!
Let's close this issue as we merged #23051 ! @NielsRogge has also made a nice tutorial in https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Pix2Struct Thanks everyone
@younesbelkada I shared a notebook on how to train Matcha/Pix2Struct model for Kaggle's Benetech competition, in case anyone is interested. This model achieved silver zone and includes the updates with the fix.
Thanks very much for sharing! It is really cool to see Matcha/Pix2Struct being using for winning notebooks in major kaggle competitions 🔥
System Info
transformers
version: 4.28.0Who can help?
@younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Here's the minimal training loop:
Here's the output I got:
Expected behavior
I've been trying to fine-tune Pix2Struct starting from the base pretrained model, and have been unable to do so. The model collapses consistently and fails to overfit on that single training sample. I noticed a comment about this on the fine-tuning notebook: https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb
To dig a little deeper, I've been trying to train on a single training sample with a minimal training loop, and see whether the model was able to correctly learn that single training sample. It seems that it's not able to overfit on a single training sample after 1000 training steps. Unless I missed something in my training loop, that seems like a weird behavior and might be a symptom of a bug somewhere?