lukas-blecher / LaTeX-OCR

pix2tex: Using a ViT to convert images of equations into LaTeX code.
https://lukas-blecher.github.io/LaTeX-OCR/
MIT License
11.93k stars 974 forks source link

Effect of resolution #223

Open with-him777 opened 1 year ago

with-him777 commented 1 year ago

Sorry to bother you again. I found that different resolution of an image will have a big impact on the recognition effect. For example, if the original resolution of some input images is reduced to 80% or enlarged to 120 percent, the recognition effect will change significantly, and the identification results will be too uncertain.

lukas-blecher commented 1 year ago

I've noticed that too, which is why I trained a small classification model to determine what resolution the input image should have. I've noted it in the Readme and I also include the train_resizer.py script for completeness.

Did you use the cli or gui for your experiment? Because by default the images should be resized there

uniartisan commented 1 year ago

I'm quite interested in this problem. I'm wondering whether there are enough image enhancement in the training. To be honest, I'm still pretty ignorant about this project. But when I am reading the source code, I think it can be resized randomly or periodically during training. I've seen something similar in quite a few image tasks, resulution varies from 224x224 all the way up to 1024x1024. In my opinion, LaTex-OCR can try the same way, I will try a new idea some week after the new year, by the way, what kind of graphics card do I need to train this task, my desktop may not be able to handle this task

https://github.com/lukas-blecher/LaTeX-OCR/blob/44d70ebc6676b27fd18c17547eefd74a43bd8490/pix2tex/dataset/transforms.py#L4

lukas-blecher commented 1 year ago

When creating the dataset I already varied the resolution of the formulas to some extent. The model supports only images with dimensions that are multiple of the patch size, so I tried to create a diverse dataset from the beginning and enhance it during training time with the suggestions you mentioned.

uniartisan commented 1 year ago

When creating the dataset I already varied the resolution of the formulas to some extent. The model supports only images with dimensions that are multiple of the patch size, so I tried to create a diverse dataset from the beginning and enhance it during training time with the suggestions you mentioned.

Sorry, my English may not be very good. If I understand correctly (plus reading the code), you are diversifying the data when creating the dataset, but not periodically and/or randomly changing the data resolution during training. What I mean is, change the image resolution again during training, mainly by making some small changes in the aspect ratio of the initial image, and then resizing the entire image, say 2-3 times larger or reduced to half the original size.

albumments.augmentations.geometric.resize.RandomScale
albumentations.augmentations.geometric.resize.SmallestMaxSize

https://albumentations.ai/docs/api_reference/augmentations/geometric/resize/#resizing-transforms-augmentationsgeometricresize

At the same time, I saw that when the text box is detected in paddleocr, a reference size is set for character recognition. I wonder if a reference size can also be set for formula recognition. Different image resolutions are multiples of the reference size.

https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/configs/rec/PP-OCRv3/ch_PP-OCRv3_rec.yml#L95