qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
59 stars 13 forks source link

try cutting corners to become faster #44

Closed bertsky closed 3 years ago

bertsky commented 3 years ago

I know the task of aligning concatenated pages is much harder than just aligning text lines from the same segmentation (as in ocrd-cor-asv-ann-evaluate). But here it's all the more pressing to get decent performance IMO.

Therefore I would suggest the following:

  1. replace the matrix calculation with some C library backend (there are many existing packages for Python that already do this). The algorithm will still be O(n²) but the linear factor will be an order of magnitude smaller IIRC.

  2. Instead of aligning the full page, try to cut corners by making the problem n smaller: first compare pairs of regions, taking the best fit, then compare pairs of lines, taking the best fit. This comparison above the line level could also be merely approximate (you only need a proportional score, no actual alignment, and you can have boundaries, no exhaustive search). For example, difflib.SequenceMatcher.quick_ratio can do this.

mikegerber commented 3 years ago
  1. Instead of aligning the full page, try to cut corners by making the problem n smaller: first compare pairs of regions, taking the best fit, then compare pairs of lines, taking the best fit. This comparison above the line level could also be merely approximate (you only need a proportional score, no actual alignment, and you can have boundaries, no exhaustive search). For example, difflib.SequenceMatcher.quick_ratio can do this.

The one thing that made me start this project is that the other evaluation tools were terrible to audit. I will not "cut corners" if that makes the evaluation/evaluation code hard to understand or even incorrect.

bertsky commented 3 years ago

The one thing that made me start this project is that the other evaluation tools were terrible to audit. I will not "cut corners" if that makes the evaluation/evaluation code hard to understand or even incorrect.

Frankly, I don't comprehend that response. What does choosing an efficient backend or making better algorithmic choices have to do with code quality?