roatienza / deep-text-recognition-benchmark

PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)
Apache License 2.0
284 stars 57 forks source link

Poor performance on some images #20

Closed dudeperf3ct closed 2 years ago

dudeperf3ct commented 2 years ago

Thank you for the awesome research!

I ran the code for demo images and it worked perfectly. But when I run the code on few sample images, the model seems to be incoherent.

It would be great if you answer few of my questions,

  1. Does the model perform end-to-end STR or does the model require a cropped image (using for ex: EAST or TextFuseNet text detectors)? Example: 1st and 2nd images below (where 1st image is cropped version of 2nd image), same case with 5th and 6th image
  2. Does the model perform multi line text recognition?
  3. Why You Should Try the Real Data for the Scene Text Recognition paper mentions in section 4.7 a scope of improvement using OpenImage v5 dataset on this research, have you tried this?

Examples:

I used vitstr_base_patch16_224_aug.pth model for prediction.

Image Prediction
test6 middleborough
test6_1 midleerooogg
test4 qatm
img_11 aoe
test2 castlecampbell
test1 coaeeea
roatienza commented 2 years ago

Thanks for the feedback: 1) ViTSTR can only process cropped text images. It does not support text spotting (detection and recognition). ViTSTR does recognition only. 2) ViTSTR does not support multiline text. Multiline text has to be cropped into several images, one for each word line. 3) We have a follow on unpublished work using a much larger real dataset for training. We will publish this in the near future (hopefully).