roatienza / deep-text-recognition-benchmark

PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)
Apache License 2.0
287 stars 59 forks source link

Is the network suit for long-text recognition? #6

Open WudiJoey opened 3 years ago

WudiJoey commented 3 years ago

Thanks for your work! I read your paper and notice that input images are resized to [224, 224]. In the case of long text line,does it influence the accuracy? Look forward to your reply!

WudiJoey commented 3 years ago

Addding: the width of the text image is often greater than the height. Can image information be preserved to the greatest extent if image is resized to square? Look forward to your reply~

roatienza commented 3 years ago

Hi, The resized images (224x224) are still human readable. The attention maps on square images also appear to be giving proper weights on each character region. Other than these, there was no empirical proof on how is the resizing affecting the accuracy. The alternative way is to resize to (100, 32) and use padding to scaled up to 224x224.

WudiJoey commented 3 years ago

Thanks for your reply~ I will try your work.

luvwinnie commented 2 years ago

I'm trying to resize a very long sentence , i resized the image to fixed apsect ratio of height 32 and padded the image to 224,224 for example the image shows like this, @WudiJoey have you ever try to train on long width image? Does it effect the accuracy even the image is squeeze something like this? Screen Shot 2021-11-10 at 12 00 15

WudiJoey commented 2 years ago

I'm trying to resize a very long sentence , i resized the image to fixed apsect ratio of height 32 and padded the image to 224,224 for example the image shows like this, @WudiJoey have you ever try to train on long width image? Does it effect the accuracy even the image is squeeze something like this? Screen Shot 2021-11-10 at 12 00 15

I haven't try your resize method because i think maybe large blank area will introduce useless infomation. I just resize my images to square directly and it can work. But i think there is a better way to process those long width images, like cutting the image and arrange them by rows.

luvwinnie commented 2 years ago

Thank you for reply! Cutting the image and arrange by rows seems like a very good way to do so, I would like to take a try.

Hmm...however currently it seems like the inputs is fixed by the base VisionTransformer, maybe we should find out a way to handle variable image just like convolution.... maybe the base Vision Transformer can be improved by using other latest vision transformer based network architecture