Poor performance on some images

roatienza / deep-text-recognition-benchmark

PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

Apache License 2.0

284 stars 57 forks source link

Thank you for the awesome research!

I ran the code for demo images and it worked perfectly. But when I run the code on few sample images, the model seems to be incoherent.

It would be great if you answer few of my questions,

Does the model perform end-to-end STR or does the model require a cropped image (using for ex: EAST or TextFuseNet text detectors)? Example: 1st and 2nd images below (where 1st image is cropped version of 2nd image), same case with 5th and 6th image
Does the model perform multi line text recognition?
Why You Should Try the Real Data for the Scene Text Recognition paper mentions in section 4.7 a scope of improvement using OpenImage v5 dataset on this research, have you tried this?

Examples:

I used vitstr_base_patch16_224_aug.pth model for prediction.

roatienza / deep-text-recognition-benchmark