roatienza / deep-text-recognition-benchmark

PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)
Apache License 2.0
293 stars 59 forks source link

why don't you normalize the images? #13

Open cuongdxk57 opened 3 years ago

cuongdxk57 commented 3 years ago

Thanks for your work. I found that you don't normalize the images before training. Is transformer better in this way? I look forward to your reply!

roatienza commented 3 years ago

Thanks. Normalization was not part of the CLOVA AI training/eval protocol that we used. So, we did not try normalization. We just reproduced their results and followed the same protocol on our ViRSTR for fair comparison.

cuongdxk57 commented 3 years ago

thanks for your reply. Is your model able to recognize the long text? I have trained on my datasets with size of image is (32,448), however, after 300k iterations, the model accuracy is quite low. These are some images on my datasets. 学べそうもないところなどがあり 11月8日(木) 買い物へスーパー

roatienza commented 3 years ago

You might want to train fr scratch (instead of a pre-trained ViT) if you have access to a big train dataset. In such cases, you can train without resizing the input image to unconventional size of 224x224 as done in ViTSTR. The closer the target test dataset image sizes to the train dataset image sizes, the better.