roatienza / deep-text-recognition-benchmark

PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)
Apache License 2.0
287 stars 59 forks source link

a question about ViTSTR #5

Open Danee-wawawa opened 3 years ago

Danee-wawawa commented 3 years ago

Hi, thank you for your work. This is a very meaningful job. Regarding algorithm design, I have a question. You convert an input image into patches firstly, if some characters are cut off or some patch contains multiple characters, will it have an impact? Looking forward to your reply.

roatienza commented 3 years ago

The image is divided into non-overlapping patches. A patch may contain 0 or more character or even partial characters only. With position embedding, the transformer is able to figure out the parts of a whole. So, it has no impact. Not tried and something that can be experimented on: overlapping patches and smaller patches as done in DINO.

Danee-wawawa commented 3 years ago

OK, thank you.