Add evaluation benchmarks

rth commented 8 months ago

Thanks for creating this package!

As discussed in https://github.com/robertknight/ocrs/issues/14 it would be nice to add some evaluation benchmarks. And maybe optionally compare with tesseract or some other reference open source OCR.

What datasets were you considering?

There is for instance the SROIE dataset of scanned recipes. The dataset can be found here (couldn't find a more official source). In particular there are two task described in their paper,

Task 1 - Scanned Receipt Text Localisation. Though I didn't get how the evaluation works exactly after skimming their paper.
Task 2 - Scanned Receipt OCR. Computing precision, recall and F1 score for all words (space tokenized) extracted from the document, as far as I understand.

robertknight commented 8 months ago

What datasets were you considering?

I'm currently working on a synthetic data generator. This has the advantage that it can provide coverage of many languages and domains, as long as a suitable source of text samples (eg. Wikipedia) is available.

Suggestions for additional datasets are welcome in the ocrs-models repo. The main requirement is that they be openly licensed for any use (requiring attribution ala. CC-BY-SA is OK).

As discussed in https://github.com/robertknight/ocrs/issues/14 it would be nice to add some evaluation benchmarks. And maybe optionally compare with tesseract or some other reference open source OCR.

I agree. I plan to publish metrics for the HierText dataset, which is the main dataset on which the models are trained. Additional benchmarks (for whatever datasets people are interested in) are an area where contributions are welcome.

rth commented 8 months ago

I'm currently working on a synthetic data generator.

https://github.com/Belval/TextRecognitionDataGenerator also sounds interesting for this.

robertknight commented 8 months ago

https://github.com/Belval/TextRecognitionDataGenerator also sounds interesting for this.

I started with this project. I found that a recognition model trained on output from an unmodified version of it achieves very low error rate in training, but doesn't generalize well when used with Ocrs. So I'm exploring changes to improve this (preprocessing adjustments, more varied backgrounds, more varied fonts etc).

One reason is that Ocrs's preprocessing cuts lines out of the surrounding image, to avoid ambiguity over which text should be recognized, as a simple rectangular cut-out might include other text. Example (produced by ocrs image.jpeg --text-line-images):

line-10

Synthetic generators need to be modified to apply similar masking.

louis030195 commented 3 months ago

also curious about performance of OCRs compared to things like:

apple native OCR
windows native OCR
trocr (microsoft) https://github.com/huggingface/candle/tree/main/candle-examples/examples/trocr
multimodal LLM

i am using apple & windows native OCR in https://github.com/louis030195/screen-pipe

but looking for solution for linux to replace tesseract which is complete garbage

robertknight commented 3 months ago

There are a few dimensions to consider:

Model size: Larger models can store more knowledge/patterns but are slower to execute and use more memory.
Functionality: Some models can both detect and read text in an image, others only recognize text in a line image
Linguistic and world knowledge: If the model is multimodal, it might be able to use that knowledge to disambiguate (eg. to understand text by looking at the context in a photo)
Training data: Is the model trained to recognize printed text, handwritten text etc.?

For model size, I consider OCR models "small" if they have a few million parameters, and large if they have hundreds or more.

On those axes:

Apple's native OCR: Small model. Does detection + recognition. Very good accuracy in my experience.
Windows native OCR: I'm not familiar with it, but I would expect it to be in a similar class to Apple's solution
Tesseract: Small model. Does detection + recognition, albeit the detection is crude. The accuracy can be good for clean document images (dark text, light background, straight, low background clutter), but can produce poor results if image has inverted colors, complex backgrounds etc.
Ocrs: Small model. Does detection + recognition. Accuracy is not as good as Apple's OCR. More tolerant of variation in background, colors etc. than Tesseract. Accuracy vs Tesseract varies depending on the image. Layout analysis (understanding of reading order) is quite dumb.
TrOCR: Large model (330M params for base), there is a medium-sized (66M params) "small" one as well. Does recognition only. Comes in printed and handwritten fine-tunes. Better accuracy than Tesseract and Ocrs but more expensive to run.
Multimodal LLM: Large models (parameter counts usually measured in billions). Can take advantage of its linguistic and world knowledge to understand text, but also much more expensive to run.

robertknight / ocrs

Add evaluation benchmarks #43