Fine tune TrOCR for persian language

PersianSpock commented 2 years ago

Model description

Hello! I'm a newbie and I am trying to use TrOCR for recognizing Persian digital text(like PDF) from image. I don't know what will be the requirements if I want to fine tune pre-trained TrOCR model but with decoder of multilingual cased. I've followed this post https://github.com/huggingface/transformers/issues/15823 but it doesn't work out for Persian with the info they gave. Please guide me on how should I proceed? I've seen that there are some models in https://huggingface.co/models?language=fa&sort=downloads but I can't figure out how to use them. Please guide me.

Open source status

[x] The model implementation is available
[x] The model weights are available

Provide useful links for the implementation

No response

NielsRogge commented 2 years ago

Hi,

I explain how to train TrOCR on a different language here: https://github.com/huggingface/transformers/issues/14195#issuecomment-1039204836

PersianSpock commented 2 years ago

Hi Niels! Thank you for your response. the thing is that I use:

from transformers import VisionEncoderDecoderModel

      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained("google/vit-base-patch16-224-in21k", "xlm-roberta-base")
      model.to(device)

And I use your Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_native_PyTorch code and at the end of the code my last result is this error:

ValueError: Input image size (384384) doesn't match model (224224).

What's wrong?

NielsRogge commented 2 years ago

It seems that the images you provide are of size 384x384, but the model (the ViT encoder) expects them to be of size 224x224.

PersianSpock commented 2 years ago

I changed the images size but it still says that

PersianSpock commented 2 years ago

import os, sys

path = '/content/drive/MyDrive/data_test/image/'
new_path = '/content/drive/MyDrive/data_test/newimage/'
dirs = os.listdir( path )

def resize():
    for item in dirs:
        source = path + item
        newsource = new_path + item
        im = Image.open(source)
        f, e = os.path.splitext(source)
        imResize = im.resize((224,224), Image.ANTIALIAS)
        imResize.save(newsource)

PersianSpock commented 2 years ago

this part still gives 384, 384:

encoding = train_dataset[0]
for k,v in encoding.items():
  print(k, v.shape)
encoding = eval_dataset[0]
for k,v in encoding.items():
  print(k, v.shape)`

PersianSpock commented 2 years ago

the problem in error seems to be because of the ViT and in your own code the training set is 384*384 as the last piece of code I commented shows what's wrong?

NielsRogge commented 2 years ago

The model I'm fine-tuning in my notebook expects images to be of size 384, as seen here.

PersianSpock commented 2 years ago

I used "google/vit-base-patch16-224-in21k" and "xlm-roberta-base". the first one you suggested in https://github.com/huggingface/transformers/issues/14195#issuecomment-1039204836 what is the issue that says the model has the picture of size 224*224?

NielsRogge commented 2 years ago

Yes, google/vit-base-patch16-224-in21k expects images to be of size 224, but you're resizing the images to 384.

PersianSpock commented 2 years ago

Thank you it got solved! How much should be my validation CER at the end? what range is good enough?

PersianSpock commented 2 years ago

I'm fine tuning trocr for Farsi language and I did it once using your code and it was ok and now with another larger dataset I get different label sizes and it's a problem. after this part: encoding = train_dataset[0] for k,v in encoding.items(): print(k, v.shape) encoding = eval_dataset[0] for k,v in encoding.items(): print(k, v.shape)

I get:

pixel_values torch.Size([3, 224, 224]) labels torch.Size([261]) pixel_values torch.Size([3, 224, 224]) labels torch.Size([272])

label torch sizes are not the same although I'm using https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR and it the code it says that the max_length for labels should be 128. how can I change the code so it'll be the same size for all of the data?

NielsRogge commented 2 years ago

How much should be my validation CER at the end?

CER (character error rate) is a number between 0 and 1, the closer to 0 the better.

Regarding the labels, you need to make sure each target sequence gets padded/truncated to the same length, to make batching possible.

PersianSpock commented 2 years ago

I'm using your own code. it has: labels = self.processor.tokenizer(text, padding="max_length", max_length=self.max_target_length).input_ids

and self.max_target_length = 128

how am I getting different numbers?

NielsRogge commented 2 years ago

Yes it doesn't have truncation=True, which you need to add.

NielsRogge commented 2 years ago

Note that the sequence length of 128 was just a choice, you can set it to whatever you think is needed for the language you're training on. If you're training on very long sentences, you might need to increase it.

PersianSpock commented 2 years ago

Thank you so much it worked out.

jonas-da commented 2 years ago

@PersianSpock which processor do you use for training on an other language ?

do you use a processor which is build up of the same encoders and decoders, or do you use the handwritten stage 1 processor, which is pre-trained already ?

it would really help, If you could post your Model and processor initialization. And maybe also your config. Thank you!

PersianSpock commented 2 years ago

@jonas-da it says here: https://huggingface.co/docs/transformers/main/model_doc/trocr#transformers.TrOCRProcessor

since I am using xlm-roberta-large I do it like this:

feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

processor = TrOCRProcessor(feature_extractor = feature_extractor, tokenizer = tokenizer

jonas-da commented 2 years ago

Ah thank you! @PersianSpock

and as above mentioned you use

model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained("google/vit-base-patch16-224-in21k", "xlm-roberta-base")

as model or ?

One more question. How much Training Data do you use and what CER did you achieved ?

Thank you very much!

PersianSpock commented 2 years ago

I used base and large both and for the data I used 7000 data and it still wasn't enough and I think I should use more.

NielsRogge commented 2 years ago

Closing this issue as it seems resolved.

jasmine400 commented 1 year ago

@PersianSpock how did you prepare dataset for train trocr on other language ?

Ulduzpp commented 9 months ago

Thank you it got solved! How much should be my validation CER at the end? what range is good enough? hi, may I ask how did you solve it? I have the same problem but I got stuck and don't know what to do

CrasCris commented 8 months ago

ow

how did you change the input size for the model ? i got this error "ValueError: Input image size (384384) doesn't match model (224224)."

huggingface / transformers