Open rth opened 8 months ago
What datasets were you considering?
I'm currently working on a synthetic data generator. This has the advantage that it can provide coverage of many languages and domains, as long as a suitable source of text samples (eg. Wikipedia) is available.
Suggestions for additional datasets are welcome in the ocrs-models repo. The main requirement is that they be openly licensed for any use (requiring attribution ala. CC-BY-SA is OK).
As discussed in https://github.com/robertknight/ocrs/issues/14 it would be nice to add some evaluation benchmarks. And maybe optionally compare with tesseract or some other reference open source OCR.
I agree. I plan to publish metrics for the HierText dataset, which is the main dataset on which the models are trained. Additional benchmarks (for whatever datasets people are interested in) are an area where contributions are welcome.
I'm currently working on a synthetic data generator.
https://github.com/Belval/TextRecognitionDataGenerator also sounds interesting for this.
https://github.com/Belval/TextRecognitionDataGenerator also sounds interesting for this.
I started with this project. I found that a recognition model trained on output from an unmodified version of it achieves very low error rate in training, but doesn't generalize well when used with Ocrs. So I'm exploring changes to improve this (preprocessing adjustments, more varied backgrounds, more varied fonts etc).
One reason is that Ocrs's preprocessing cuts lines out of the surrounding image, to avoid ambiguity over which text should be recognized, as a simple rectangular cut-out might include other text. Example (produced by ocrs image.jpeg --text-line-images
):
Synthetic generators need to be modified to apply similar masking.
also curious about performance of OCRs compared to things like:
i am using apple & windows native OCR in https://github.com/louis030195/screen-pipe
but looking for solution for linux to replace tesseract which is complete garbage
There are a few dimensions to consider:
For model size, I consider OCR models "small" if they have a few million parameters, and large if they have hundreds or more.
On those axes:
Thanks for creating this package!
As discussed in https://github.com/robertknight/ocrs/issues/14 it would be nice to add some evaluation benchmarks. And maybe optionally compare with tesseract or some other reference open source OCR.
What datasets were you considering?
There is for instance the SROIE dataset of scanned recipes. The dataset can be found here (couldn't find a more official source). In particular there are two task described in their paper,