robertknight / ocrs

Rust library and CLI tool for OCR (extracting text from images)
Apache License 2.0
1.09k stars 44 forks source link

Add evaluation benchmarks #43

Open rth opened 5 months ago

rth commented 5 months ago

Thanks for creating this package!

As discussed in https://github.com/robertknight/ocrs/issues/14 it would be nice to add some evaluation benchmarks. And maybe optionally compare with tesseract or some other reference open source OCR.

What datasets were you considering?

There is for instance the SROIE dataset of scanned recipes. The dataset can be found here (couldn't find a more official source). In particular there are two task described in their paper,

robertknight commented 5 months ago

What datasets were you considering?

I'm currently working on a synthetic data generator. This has the advantage that it can provide coverage of many languages and domains, as long as a suitable source of text samples (eg. Wikipedia) is available.

Suggestions for additional datasets are welcome in the ocrs-models repo. The main requirement is that they be openly licensed for any use (requiring attribution ala. CC-BY-SA is OK).

As discussed in https://github.com/robertknight/ocrs/issues/14 it would be nice to add some evaluation benchmarks. And maybe optionally compare with tesseract or some other reference open source OCR.

I agree. I plan to publish metrics for the HierText dataset, which is the main dataset on which the models are trained. Additional benchmarks (for whatever datasets people are interested in) are an area where contributions are welcome.

rth commented 5 months ago

I'm currently working on a synthetic data generator.

https://github.com/Belval/TextRecognitionDataGenerator also sounds interesting for this.

robertknight commented 5 months ago

https://github.com/Belval/TextRecognitionDataGenerator also sounds interesting for this.

I started with this project. I found that a recognition model trained on output from an unmodified version of it achieves very low error rate in training, but doesn't generalize well when used with Ocrs. So I'm exploring changes to improve this (preprocessing adjustments, more varied backgrounds, more varied fonts etc).

One reason is that Ocrs's preprocessing cuts lines out of the surrounding image, to avoid ambiguity over which text should be recognized, as a simple rectangular cut-out might include other text. Example (produced by ocrs image.jpeg --text-line-images):

line-10

Synthetic generators need to be modified to apply similar masking.

louis030195 commented 3 weeks ago

also curious about performance of OCRs compared to things like:

i am using apple & windows native OCR in https://github.com/louis030195/screen-pipe

but looking for solution for linux to replace tesseract which is complete garbage

robertknight commented 3 weeks ago

There are a few dimensions to consider:

For model size, I consider OCR models "small" if they have a few million parameters, and large if they have hundreds or more.

On those axes: