robertknight / ocrs

Rust library and CLI tool for OCR (extracting text from images)
Apache License 2.0
1.09k stars 44 forks source link

Add script to evaluate perfomance on SROIE dataset #44

Open rth opened 5 months ago

rth commented 5 months ago

Related to the discussion in https://github.com/robertknight/ocrs/issues/43 this adds a script to evaluate on the SROIE 2019 dataset (scanned recipes). I wanted to do end-to-end eval, and needed the executable, so it seemed easier to put it here rather than in https://github.com/robertknight/ocrs-models.

But feel free to close, I was mostly curious about the results.

To run this script:

  1. Install dependencies:
    • pip install scikit-learn datasets tqdm (I saw there are some metrics in orcs-models, but for text vectorization it seemed easier to use scikit-learn)
  2. Optionally install pytesseract + tesseract
  3. Run,
    python tools/evaluate-sroie.py

which produces (on the first 100 images / ~230)

Evaluating on SROIE 2019 dataset...
 - ORCS: 1.45 s / image, precision 0.96, recall 0.84, F1 0.90
 - Tesseract: 0.84 s / image, precision 0.36, recall 0.34, F1 0.35

The precision, recall scores are computed globally on the text extracted from the image, after tokenizing with scikit-learn's vectorizer.

So overall the scores look quite good! I'm not sure, maybe I'm not using tesseract right, it's performance looks pretty bad on this dataset. Or maybe it needs some pre-processing.

Run time is a bit slower than tesseract, but I imagine that could always be improved somewhat.

rth commented 5 months ago

Thanks for the review. Addressed review comments.