ivelin / donut_ui_refexp

Fine tuning Donut transformers for UI Referring Expressions task
Apache License 2.0
7 stars 2 forks source link

Tokenizer returns <UNK> token for certain coordinates #1

Open morganmcg1 opened 1 year ago

morganmcg1 commented 1 year ago

Hey! Enjoying playing around with your notebooks, thanks for sharing!

Have you noticed that the tokenizer seems to struggle with some bounding box values and will return an token?

inp_txt = "<s_ymax>0.1</s_ymax>"

input_ids = processor.tokenizer(
              inp_txt,
              add_special_tokens=False,
              max_length=30,
              padding="max_length",
              truncation=True,
              return_tensors="pt",
          )["input_ids"].squeeze(0)

will return:

tensor([57535, 50891, 39539,     3, 57536,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1])

Note the "unk" token 3

Checking down to 3 decimal places, these input strings all return an token

<s_ymax>0.1</s_ymax>
<s_ymax>0.121</s_ymax>
<s_ymax>0.131</s_ymax>
<s_ymax>0.161</s_ymax>
<s_ymax>0.171</s_ymax>
<s_ymax>0.181</s_ymax>
<s_ymax>0.191</s_ymax>

Getting the exact problematic parts of the string:

<s_ymax>0.1</s_ymax>
['<s_ymax>', '0', '.', '<unk>', '</s_ymax>', '<pad>', '<pad>']
tensor([57535, 50891, 39539,     3, 57536,     1,     1])

<s_ymax>0.121</s_ymax>
['<s_ymax>', '0', '.', '12', '<unk>', '</s_ymax>', '<pad>']
tensor([57535, 50891, 39539, 12569,     3, 57536,     1])

<s_ymax>0.131</s_ymax>
['<s_ymax>', '0', '.', '13', '<unk>', '</s_ymax>', '<pad>']
tensor([57535, 50891, 39539, 34532,     3, 57536,     1])

<s_ymax>0.161</s_ymax>
['<s_ymax>', '0', '.', '16', '<unk>', '</s_ymax>', '<pad>']
tensor([57535, 50891, 39539, 32557,     3, 57536,     1])

<s_ymax>0.171</s_ymax>
['<s_ymax>', '0', '.', '17', '<unk>', '</s_ymax>', '<pad>']
tensor([57535, 50891, 39539, 32827,     3, 57536,     1])

<s_ymax>0.181</s_ymax>
['<s_ymax>', '0', '.', '18', '<unk>', '</s_ymax>', '<pad>']
tensor([57535, 50891, 39539, 18590,     3, 57536,     1])

<s_ymax>0.191</s_ymax>
['<s_ymax>', '0', '.', '19', '<unk>', '</s_ymax>', '<pad>']
tensor([57535, 50891, 39539, 35593,     3, 57536,     1])

It struggles with: 0.1, 0.121, 0.131, 0.161, 0.171, 0.181, 0.191

ivelin commented 1 year ago

Yes, there were more \ tokens in the early epochs of training. Over time the model seems to learn that they are the wrong choice.

I've trained this model for a few days on a Colab Standard GPU. Only had limited time and GPU resources for this experiment. Applied for a HF community grant but have not heard back.

At the time the latest checkpoint was pushed , the model was nowhere close to convergence. Hope if people are interested in it, someone else will chip in with further training and bug fixes.

It can be argued that a dual visual-language encoder model is a more popular architecture for this task (open vocabulary object detection). My rationale was that if we can get it to learn bounding boxes then it can be extended to downstream tasks such as action classification plus bounding box and even input values in one inference pass instead of using multiple models for each task. Further it could potentially learn the reciprocal task of providing unambiguous component reference expression given a bounding box and a screenshot. All just theories for further research. :)

morganmcg1 commented 1 year ago

love it, I'm spending a little time on it in my spare time, right now just verifying the data processing steps and pulling a bunch of it out of the pytorch dataset class to simplify things a little. Will share some work when I have some. For the unk tokens above I'm adding them to the tokenizer to see if it makes a difference - i'm guessing it won't, but it nags me that there might be some coordinates that the model will ignore :D

ivelin commented 1 year ago

Sounds great. Happy to collaborate. Feel free to submit a PR if you like. I should be able to review within a day or two.

BTW, \ is already a known token from the original donut model.

In fact, this reminds me that the current training assumes only positive samples. All dataset samples have a bounding box. In reality a refexp may be nonsensical in which case the model should respond with \. We can potentially add some contrastive learning via augmented data to a subset of the training data on the fly. Crop out a big enough area around the target bounding box and replace it with an \ label, such that the refexp is meaningless.

Yet another related issue that I just remembered - there are some samples in the synthetic RICO SCA dataset (starting at around index 16,000) that use referring expressions to text labels that occur multiple times in the screenshot. We can look at ways of removing them so they don't confuse the model or potentially allow the model to predict multiple bounding boxes.

morganmcg1 commented 1 year ago

Oh yeah negatives is a good idea.

I opened a PR on transformers to fix training with batch sizes > 1 which got merged:

I also opened one on the HF Hub to correct their normalization in image_processor as the original authors used ImageNet normalization from what I can see:

I'm going to do all the image processing without the image_processor as it was also messing up the bounding box positions when images were resized

ivelin commented 1 year ago
  • Fix for non-contiguous label tensors in VisonEncoderDecoder huggingface/transformers#21582

So cool. I ran into it but was not sure how to fix correctly. Used grad_accumulation_batches as a workaround.

ivelin commented 1 year ago

I'm going to do all the image processing without the image_processor as it was also messing up the bounding box positions when images were resized

So glad you caught this. I was suspicious the resizing left out certain portions of the screenshots in some cases.