Closed PersianSpock closed 2 years ago
Hi,
I explain how to train TrOCR on a different language here: https://github.com/huggingface/transformers/issues/14195#issuecomment-1039204836
Hi Niels! Thank you for your response. the thing is that I use:
from transformers import VisionEncoderDecoderModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained("google/vit-base-patch16-224-in21k", "xlm-roberta-base")
model.to(device)
And I use your Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_native_PyTorch code and at the end of the code my last result is this error:
ValueError: Input image size (384384) doesn't match model (224224).
What's wrong?
It seems that the images you provide are of size 384x384, but the model (the ViT encoder) expects them to be of size 224x224.
I changed the images size but it still says that
import os, sys
path = '/content/drive/MyDrive/data_test/image/'
new_path = '/content/drive/MyDrive/data_test/newimage/'
dirs = os.listdir( path )
def resize():
for item in dirs:
source = path + item
newsource = new_path + item
im = Image.open(source)
f, e = os.path.splitext(source)
imResize = im.resize((224,224), Image.ANTIALIAS)
imResize.save(newsource)
this part still gives 384, 384:
encoding = train_dataset[0]
for k,v in encoding.items():
print(k, v.shape)
encoding = eval_dataset[0]
for k,v in encoding.items():
print(k, v.shape)`
the problem in error seems to be because of the ViT and in your own code the training set is 384*384 as the last piece of code I commented shows what's wrong?
I used "google/vit-base-patch16-224-in21k" and "xlm-roberta-base". the first one you suggested in https://github.com/huggingface/transformers/issues/14195#issuecomment-1039204836 what is the issue that says the model has the picture of size 224*224?
Yes, google/vit-base-patch16-224-in21k
expects images to be of size 224, but you're resizing the images to 384.
Thank you it got solved! How much should be my validation CER at the end? what range is good enough?
I'm fine tuning trocr for Farsi language and I did it once using your code and it was ok and now with another larger dataset I get different label sizes and it's a problem.
after this part:
encoding = train_dataset[0] for k,v in encoding.items(): print(k, v.shape) encoding = eval_dataset[0] for k,v in encoding.items(): print(k, v.shape)
I get:
pixel_values torch.Size([3, 224, 224]) labels torch.Size([261]) pixel_values torch.Size([3, 224, 224]) labels torch.Size([272])
label torch sizes are not the same although I'm using https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR and it the code it says that the max_length for labels should be 128. how can I change the code so it'll be the same size for all of the data?
How much should be my validation CER at the end?
CER (character error rate) is a number between 0 and 1, the closer to 0 the better.
Regarding the labels, you need to make sure each target sequence gets padded/truncated to the same length, to make batching possible.
I'm using your own code. it has:
labels = self.processor.tokenizer(text, padding="max_length", max_length=self.max_target_length).input_ids
and
self.max_target_length = 128
how am I getting different numbers?
Yes it doesn't have truncation=True
, which you need to add.
Note that the sequence length of 128 was just a choice, you can set it to whatever you think is needed for the language you're training on. If you're training on very long sentences, you might need to increase it.
Thank you so much it worked out.
@PersianSpock which processor do you use for training on an other language ?
do you use a processor which is build up of the same encoders and decoders, or do you use the handwritten stage 1 processor, which is pre-trained already ?
it would really help, If you could post your Model and processor initialization. And maybe also your config. Thank you!
@jonas-da it says here: https://huggingface.co/docs/transformers/main/model_doc/trocr#transformers.TrOCRProcessor
since I am using xlm-roberta-large I do it like this:
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
processor = TrOCRProcessor(feature_extractor = feature_extractor, tokenizer = tokenizer
Ah thank you! @PersianSpock
and as above mentioned you use
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained("google/vit-base-patch16-224-in21k", "xlm-roberta-base")
as model or ?
One more question. How much Training Data do you use and what CER did you achieved ?
Thank you very much!
I used base and large both and for the data I used 7000 data and it still wasn't enough and I think I should use more.
Closing this issue as it seems resolved.
@PersianSpock how did you prepare dataset for train trocr on other language ?
Thank you it got solved! How much should be my validation CER at the end? what range is good enough? hi, may I ask how did you solve it? I have the same problem but I got stuck and don't know what to do
ow
how did you change the input size for the model ? i got this error "ValueError: Input image size (384384) doesn't match model (224224)."
Model description
Hello! I'm a newbie and I am trying to use TrOCR for recognizing Persian digital text(like PDF) from image. I don't know what will be the requirements if I want to fine tune pre-trained TrOCR model but with decoder of multilingual cased. I've followed this post https://github.com/huggingface/transformers/issues/15823 but it doesn't work out for Persian with the info they gave. Please guide me on how should I proceed? I've seen that there are some models in https://huggingface.co/models?language=fa&sort=downloads but I can't figure out how to use them. Please guide me.
Open source status
Provide useful links for the implementation
No response