Closed Samreenhabib closed 1 year ago
Hi,
In case you want train a TrOCR model on another language, you can warm-start (i.e. initialize the weights of) the encoder and decoder with pretrained weights from the hub, as follows:
from transformers import VisionEncoderDecoderModel
# initialize the encoder from a pretrained ViT and the decoder from a pretrained BERT model.
# Note that the cross-attention layers will be randomly initialized, and need to be fine-tuned on a downstream dataset
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
"google/vit-base-patch16-224-in21k", "urduhack/roberta-urdu-small"
)
Here, I'm initializing the weights of the decoder from a RoBERTa language model trained on the Urdu language from the hub.
Thankyou @NielsRogge . I was not adding last two configurations. Luckily, its working now. Thanks a lot again :)
Hey there, I want to know is not using processor will affect training accuracy? As I've tried to replace TrOCR processor with ViT feature extractor and roberta as detokenizer as follows:
class IAMDataset(Dataset):
def __init__(self, root_dir, df, feature_extractor,tokenizer, max_target_length=128):
self.root_dir = root_dir
self.df = df
self.feature_extractor = feature_extractor
self.tokenizer = tokenizer
self.max_target_length = max_target_length
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
file_name = self.df['file_name'][idx]
text = self.df['text'][idx]
image = Image.open(self.root_dir + file_name).convert("RGB")
pixel_values = self.feature_extractor(image, return_tensors="pt").pixel_values
labels = self.tokenizer(text, padding="max_length", max_length=self.max_target_length).input_ids
labels = [label if label != self.tokenizer.pad_token_id else -100 for label in labels]
encoding = {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}
return encoding
After training on 998 images(IAM Handwritten) with text-image pair, model even cant recognize text from trained image. Is it related to size of training dataset or processor is important for OCR case?
Hi,
Are you using the following tokenizer:
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("urduhack/roberta-urdu-small")
?
Cause that is required for the model to work on another language.
Yes, I'm calling Vit + urduhack/roberta as encoder and decoder. For testing purpose, I've trained this model on 20 image-text pairs. When I'm trying to recognize text from image, output text is composed of repeating words as shown in image. I know training sample is not under requirement, please highlight what I'm doing wrong while recognizing text from image:
model = VisionEncoderDecoderModel.from_pretrained("./wo_processor")
image = Image.open('/content/10.png').convert("RGB")
pixel_values = feature_extractor(image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text= decoder_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
Which version of Transformers are you using? Cause the eos_token_id
must be properly set in the decoder.
We recently fixed the generate()
method to take the eos_token_id
of the decoder into account (see #14905).
Can the model properly overfit the 20 image-text pairs?
Transformers version: 4.17.0
Currently, model.config.eos_token_id = decoder_tokenizer.sep_token_id
is set.
Can the model properly overfit the 20 image-text pairs?
Is the image you're testing on included in the 20 pairs?
No..model is not generating correct text even for image from training set.
Ok then I suggest to first debug that, see also this post.
Hey @NielsRogge , after double-checking image text list and saving model via trainer.save_model
, code is running and output is as expected ..Thanks for all guidance. 👍
Hi,
In case you want train a TrOCR model on another language, you can warm-start (i.e. initialize the weights of) the encoder and decoder with pretrained weights from the hub, as follows:
from transformers import VisionEncoderDecoderModel # initialize the encoder from a pretrained ViT and the decoder from a pretrained BERT model. # Note that the cross-attention layers will be randomly initialized, and need to be fine-tuned on a downstream dataset model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained( "google/vit-base-patch16-224-in21k", "urduhack/roberta-urdu-small" )
Here, I'm initializing the weights of the decoder from a RoBERTa language model trained on the Urdu language from the hub.
Hi @NielsRogge. Thank you for your wonderful tutorial on fine tuning the TrOCR. I am trying to tune the TrOCR for Arabic Language. I have collected and arranged the data as explained in your tutorial. Which pre-trained model, I need to use. Like the one you have mentioned above for Urudu, any pretrined weights are available for Arabic ?.
Hi,
You can filter on your language by clicking on the "models" tab, then selecting a language on the left: https://huggingface.co/models?language=ar&sort=downloads
So you can for instance initialize the weights of the decoder with those of https://huggingface.co/aubmindlab/bert-base-arabertv02
Hi @NielsRogge Thanks a lot for the tutorial. When using VisionEncoderDecoder and training it on a new dataset as in the tutorial, which parts of the model (encoder and decoder) are frozen and which are trainable?
Hi!
When using VisionEncoderDecoder and training it on a new dataset as in the tutorial, which parts of the model (encoder and decoder) are frozen and which are trainable?
All weights are updated! You initialize the weights of the encoder with those of a pre-trained vision encoder (like ViT), initialize the weights of the decoder with those of a pre-trained text model (like BERT, GPT-2) and randomly initialize the weights of the cross-attention layers in the decoder. Next, all weights are updated based on a labeled dataset of (image, text) pairs.
Thanks for your answer @NielsRogge
I have another question and it's about the configuration cell in the TrOCR tutorial ; the cell in which we set special tokens etc.
Is it normal that once we set model.config.decoder_start_token_id
to processor.tokenizer.cls_token_id
and go back to the model (still in stage1 and no fine-tuning was performed), the pretrained model's output changes (to worse).
Example:
I have the following image: I apply the pretrained model (TrOCR-base-stage1) on this image of printed text, it works fine and the generated ids are:
tensor([[ 2, 417, 108, 879, 3213, 4400, 10768, 438, 1069, 21454,
579, 293, 4685, 4400, 40441, 20836, 5332, 2]])
I notice tha it starts and ends with the id number 2 (does that mean that cls=eos in pretraining phase?)
When these generated_ids
are decoded I get exactly what's on the image:
d’un accident et décrivant ses causes et circonstances
But once I run the configuration cell (specificly the line setting model.config.decoder_start_token_id
), the generated_ids
from the same image becomes:
tensor([[0, 4, 2]])
which is just a dot when decoded by tokenizer. I want to know if this is normal/expected behavior ?
Yeah that definitely will change behaviour. If you check
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-stage1")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-stage1")
print(model.config.decoder.decoder_start_token_id)
you'll see that it's set to 2.
However, if you set it to processor.tokenizer.cls_token_id
, then you set it to 0. But the model was trained with ID=2 as decoder start token ID.
Hi @NielsRogge, Thank you for your detail responses. I followed your reply to my query and used your tutorial to train the TrOCR model
My objective is to recognize the Arabic characters from the license plate image. I have segmented the words alone using EAST. Few words are displayed below:
I have trained the TrOCR with my custom data (Arabic Alphabets). 29 alphabet characters are there in the dataset (only one alphabet per image). Combination of basic alphabets lead to new letter. I need to include these letters in the dataset.
Dataset contains 2900 images (100 image for each alphabet).
I have used the following models for encoder and decoder:
I changed the max_length as 4 and n_gram_size as 1. I have not changed the vocab_size.
I trained the model and for 3 epochs. CER is fluctuating. At 1400 step, got a CER of 0.54, after that it was increased and fluctuated, but not reduced below 0.54. On the saved model, I tested the image with single character, it is doing reasonably well in predicting that character. But while I am giving multiple character in one single image, it is failing miserably.
-- Predicted correctly by printing the correct text
-- Predicted correctly by printing the correct text
-- Predicted correctly by printing the correct text
-- NOT Predicting anything, retuning null.
I am using following code to predict the above images on the trained model:
Please let me know the mistake I am committing. Is it because that I am training the model with individual character images rather that word images ? Do I need to do some modifications in config settings or Do I need to do something with tokenizer.
I have attached my training code link here: https://colab.research.google.com/drive/11ARSwRinMj4l8qwGhux074G6RL9hCdrW?usp=sharing
Hi,
If you're only training the model on individual character images, then I'm pretty sure it won't be able to produce a reasonable prediction if you give it a sequence of characters at inference time. You would have to train on a sequence of characters as well during training.
Also, a max length of 4 is rather low, I would increase this during training.
Thank you. I will collect the images of words and train again. Any suggestion on how many images minimum we need for reasonable training ?
I would start with at least 100 (image, text) pairs, and as usual, the more data you have, the better.
you'll see that it's set to 2.
However, if you set it to
processor.tokenizer.cls_token_id
, then you set it to 0. But the model was trained with ID=2 as decoder start token ID.
Can you explain to me why model.config.decoder_start_token_id
(null) is set to processor.tokenizer.cls_token_id
(0) and not to model.decoder.config.decoder_start_token_id
(2) because it seems to me like the less confusing optiopn (for the model).
@Samreenhabib hi, can i get your contact ? I want to ask more about fine tuned for multilingual case
Can you explain to me why model.config.decoder_start_token_id (null) is set to processor.tokenizer.cls_token_id (0) and not to model.decoder.config.decoder_start_token_id (2) because it seems to me like the less confusing optiopn (for the model).
I think that's because the TrOCR authors initialized the decoder with the weights of RoBERTa, an encoder-only Transformer model. Hence, they used the CLS token as start token.
Hi @NielsRogge , I'm having a problem with fine-tuning Base-TrOCR. I launched the training two times and it stopped at an intermediate training step raising the error :
RuntimeError: stack expects each tensor to be equal size, but got [128] at entry 0 and [152] at entry 3
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<command-2931901844376276> in <module>
11 )
12
---> 13 training_results = trainer.train()
/databricks/python/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1394
1395 step = -1
-> 1396 for step, inputs in enumerate(epoch_iterator):
1397
1398 # Skip past any already trained steps if resuming training
/databricks/python/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
433 if self._sampler_iter is None:
434 self._reset()
--> 435 data = self._next_data()
436 self._num_yielded += 1
437 if self._dataset_kind == _DatasetKind.Iterable and \
/databricks/python/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
473 def _next_data(self):
474 index = self._next_index() # may raise StopIteration
--> 475 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
476 if self._pin_memory:
477 data = _utils.pin_memory.pin_memory(data)
/databricks/python/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
45 else:
46 data = self.dataset[possibly_batched_index]
---> 47 return self.collate_fn(data)
/databricks/python/lib/python3.7/site-packages/transformers/data/data_collator.py in default_data_collator(features, return_tensors)
64
65 if return_tensors == "pt":
---> 66 return torch_default_data_collator(features)
67 elif return_tensors == "tf":
68 return tf_default_data_collator(features)
/databricks/python/lib/python3.7/site-packages/transformers/data/data_collator.py in torch_default_data_collator(features)
126 if k not in ("label", "label_ids") and v is not None and not isinstance(v, str):
127 if isinstance(v, torch.Tensor):
--> 128 batch[k] = torch.stack([f[k] for f in features])
129 else:
130 batch[k] = torch.tensor([f[k] for f in features])
RuntimeError: stack expects each tensor to be equal size, but got [128] at entry 0 and [152] at entry 3
I'm not sure if it's from the tokenizer or the feature extractor (both in th TrOCR processor from the tutorial). or is it because we are calling the dataset's label output labels
instead of label
/label_ids
(see last if statement in the traceback). I don't want to use more compute power to test all my hypotheses. Can you help me with this one? I hope you're more familiar with the internals of the seq2seq trainer. Thank you very much in advance.
I see we are using the default_data_collator
but our dataset returns labels
instead of labels_ids
could that be the problem?
https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.default_data_collator
but it doesn't explain why one of the "features" would be of size 152.
Hi, the default_data_collator
just stacks the pixel values and labels along the first (batch) dimension.
However, in order to stack tensors, they all need to have the same shape. It seems like you didn't truncate some labels (which are input_ids
of the encoded text). Can you verify that you pad + truncate when creating the labels
?
Hi @Samreenhabib kindly share with me model configuration on urdu images data.Thanks
Hey @NielsRogge , I am stuck at one place, need your help. I trained tokenizer on Urdu text lines using following:
tokenizer = ByteLevelBPETokenizer(lowercase=True)
tokenizer.train(files=paths, vocab_size=8192, min_frequency=2,
show_progress=True,
special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
tokenizer.save_model(tokenizer_folder)
Here are encoder decoder configurations:
config = RobertaConfig(
vocab_size=8192,
max_position_embeddings=514,
num_attention_heads=12,
num_hidden_layers=6,
type_vocab_size=1,
)
decoder = RobertaModel(config=config)
decoder_tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer_folder, max_len=512)
decoder.resize_token_embeddings(len(decoder_tokenizer))
encoder_config = ViTConfig(image_size=384)
encoder = ViTModel(encoder_config)
model = VisionEncoderDecoderModel(encoder=encoder, decoder=decoder)
model.config.decoder.is_decoder = True
model.config.decoder.add_cross_attention = True
Upon trainer.train()
this is what I'm receiving: BaseModelOutputWithPoolingAndCrossAttentions' object has no attribute 'logits'
Hi @Samreenhabib kindly share with me model configuration on urdu images data.Thanks
Hey, apologies for late reply. I dont know if you still looking for configurations. The code was exact same as in https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb
However, to call trainer trained on your dataset , ensure you save it :trainer.save_model('./urdu_trainer')
. Then simply call
model = VisionEncoderDecoderModel.from_pretrained("./urdu_trainer")
image = Image.open('/content/40.png').convert("RGB")
image
pixel_values = processor.feature_extractor(image, return_tensors="pt").pixel_values
print(pixel_values.shape)
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
For more questions, please use the forum, as we'd like to keep Github issues for bugs/feature requests.
I am trying to use TrOCR for recognizing Urdu text from image. For feature extractor, I am using DeiT and bert-base-multilingual-cased as decoder. I can't figure out what will be the requirements if I want to fine tune pre-trained TrOCR model but with decoder of multilingual cased. I've followed https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR tutorial but, it cant understand urdu text as expected I guess. Please guide me on how should I proceed? Should I create and train new tokenizer build for urdu? If yes, then how can I integrate it with ViT?
Hi @Samreenhabib kindly share with me model configuration on urdu images data.Thanks
Hey, apologies for late reply. I dont know if you still looking for configurations. The code was exact same as in https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb However, to call trainer trained on your dataset , ensure you save it :
trainer.save_model('./urdu_trainer')
. Then simply callmodel = VisionEncoderDecoderModel.from_pretrained("./urdu_trainer") image = Image.open('/content/40.png').convert("RGB") image pixel_values = processor.feature_extractor(image, return_tensors="pt").pixel_values print(pixel_values.shape) generated_ids = model.generate(pixel_values) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(generated_text)
Hi @Samreenhabib, Can you please share your code with me? I am trying to fine tune trocr with bangla dataset and being a beginner, I am facing lots of problem. If would be very helpful, if you could share your code. I will be grateful to you. Thanks!
Hi @SamithaShetty, as @NielsRogge notes in the comment above, questions like these are best placed in our forums. We try to reserve the github issues for feature requests and bug reports.
Hey there, I want to know is not using processor will affect training accuracy? As I've tried to replace TrOCR processor with ViT feature extractor and roberta as detokenizer as follows:
class IAMDataset(Dataset): def __init__(self, root_dir, df, feature_extractor,tokenizer, max_target_length=128): self.root_dir = root_dir self.df = df self.feature_extractor = feature_extractor self.tokenizer = tokenizer self.max_target_length = max_target_length def __len__(self): return len(self.df) def __getitem__(self, idx): file_name = self.df['file_name'][idx] text = self.df['text'][idx] image = Image.open(self.root_dir + file_name).convert("RGB") pixel_values = self.feature_extractor(image, return_tensors="pt").pixel_values labels = self.tokenizer(text, padding="max_length", max_length=self.max_target_length).input_ids labels = [label if label != self.tokenizer.pad_token_id else -100 for label in labels] encoding = {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)} return encoding
After training on 998 images(IAM Handwritten) with text-image pair, model even cant recognize text from trained image. Is it related to size of training dataset or processor is important for OCR case?
Hi, I have been working on TrOCR recently, and I am very new to these things. I am trying to extend TrOCR to all 22 scheduled Indian Languages. From my understanding, I have used AutoImageProcessor and AutoTokenizer class and for ecoder i have used BEiT and IndicBERTv2 respectively as it supports all the 22 languages.
But, i have been facing some issue, I am using Synthtetic generated dataset, which of almost of the same config of IAMDataset. I have been training the model out with 2M examples for Bengali. And 20M examples of Hindi+Bengali(10M each) seperately. For my training with Bengali only(2M) - Upon running the inference after 10 epochs, I am facing the same error as mentioned by @Samreenhabib ,Generated text has repetation of first word only.
For my training on Hindi+Bengali(20M)- Upon running inference after 3 epochs, I am facing the same issue as mentioned by (@IlyasMoutawwakil ) ,where the generated texts are just dots and commas, I am using the same code as mentioned in @NielsRogge 's tutorial with pytorch, just i have added the implementation of Accelerate to train on multiple GPU's. Any kind of help or suggestions would really help a lot as my internship is getiing over within a week, so i have to figure the error out as soon as possible.
Thank you so much
I will attatch the initialisation cell below: from transformers import AutoImageProcessor, AutoTokenizer,TrOCRProcessor
image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k")
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/IndicBERTv2-MLM-only")
processor = TrOCRProcessor(feature_extractor = image_processor, tokenizer = tokenizer)
train_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/NewNOISE/bn/images/', df=train_df, processor=processor) eval_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/NewNOISE/bn/images/', df=test_df, processor=processor)
from transformers import VisionEncoderDecoderModel import torch
Thank you again
Hey there, I want to know is not using processor will affect training accuracy? As I've tried to replace TrOCR processor with ViT feature extractor and roberta as detokenizer as follows:
class IAMDataset(Dataset): def __init__(self, root_dir, df, feature_extractor,tokenizer, max_target_length=128): self.root_dir = root_dir self.df = df self.feature_extractor = feature_extractor self.tokenizer = tokenizer self.max_target_length = max_target_length def __len__(self): return len(self.df) def __getitem__(self, idx): file_name = self.df['file_name'][idx] text = self.df['text'][idx] image = Image.open(self.root_dir + file_name).convert("RGB") pixel_values = self.feature_extractor(image, return_tensors="pt").pixel_values labels = self.tokenizer(text, padding="max_length", max_length=self.max_target_length).input_ids labels = [label if label != self.tokenizer.pad_token_id else -100 for label in labels] encoding = {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)} return encoding
After training on 998 images(IAM Handwritten) with text-image pair, model even cant recognize text from trained image. Is it related to size of training dataset or processor is important for OCR case?
Hi, I have been working on TrOCR recently, and I am very new to these things. I am trying to extend TrOCR to all 22 scheduled Indian Languages. From my understanding, I have used AutoImageProcessor and AutoTokenizer class and for ecoder i have used BEiT and IndicBERTv2 respectively as it supports all the 22 languages.
But, i have been facing some issue, I am using Synthtetic generated dataset, which of almost of the same config of IAMDataset. I have been training the model out with 2M examples for Bengali. And 20M examples of Hindi+Bengali(10M each) seperately. For my training with Bengali only(2M) - Upon running the inference after 10 epochs, I am facing the same error as mentioned by @Samreenhabib ,Generated text has repetation of first word only.
For my training on Hindi+Bengali(20M)- Upon running inference after 3 epochs, I am facing the same issue as mentioned by (@IlyasMoutawwakil ) ,where the generated texts are just dots and commas, I am using the same code as mentioned in @NielsRogge 's tutorial with pytorch, just i have added the implementation of Accelerate to train on multiple GPU's. Any kind of help or suggestions would really help a lot as my internship is getiing over within a week, so i have to figure the error out as soon as possible.
Thank you so much
I will attatch the initialisation cell below: from transformers import AutoImageProcessor, AutoTokenizer,TrOCRProcessor
image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k")
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/IndicBERTv2-MLM-only")
processor = TrOCRProcessor(feature_extractor = image_processor, tokenizer = tokenizer) #processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-stage1") train_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/NewNOISE/bn/images/', df=train_df, processor=processor) eval_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/NewNOISE/bn/images/', df=test_df, processor=processor)
from transformers import VisionEncoderDecoderModel import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device="cpu" enc='microsoft/beit-base-patch16-224-pt22k-ft22k' dec='ai4bharat/IndicBERTv2-MLM-only' model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(enc,dec) model.to(device) Thank you again
@AnustupOCR , it seems like you are not saving processor according to your requirement. Please take a look on the code here: https://github.com/Samreenhabib/Urdu-OCR/blob/main/Custom%20Transformer%20OCR/Custom%20TrOCR.ipynb
how to pick encoder and decoder for fine-tune TrOCR on a specifique langage ?
@NielsRogge Hello sir I am trying TrOCR on Devnagari handwritten text, I would like to know which decoder will be best for this ?
I am trying to use TrOCR for recognizing Urdu text from image. For feature extractor, I am using DeiT and bert-base-multilingual-cased as decoder. I can't figure out what will be the requirements if I want to fine tune pre-trained TrOCR model but with decoder of multilingual cased. I've followed https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR tutorial but, it cant understand urdu text as expected I guess. Please guide me on how should I proceed? Should I create and train new tokenizer build for urdu? If yes, then how can I integrate it with ViT?