qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
62 stars 15 forks source link

Add flexible character accuracy #47

Closed b2m closed 3 years ago

b2m commented 3 years ago

This is a first draft for adding the flexible character accuracy as suggested by @cneud in #32.

C. Clausner, S. Pletschacher, A. Antonacopoulos , "Flexible character accuracy measure for reading-order-independent evaluation", Pattern Recognition Letters, Volume 131, March 2020, Pages 390-397

There are still some open topics so I opened this pull request as draft so you may already comment on some issues.

Handling of coefficients

The algorithm uses a "range of coefficients for penalty calculation" (see Table 1 in the paper).

Coefficient Min Max Step
minDist 15 30 5
lengthDiff 0 23 3
offset 0 3 1
length 0 5 1

Should we make the coefficients configurable? If so, we might need a configuration file because handling 12 additional parameters on the command line is quite messy.

The runs for each set of coefficients is also a good place for parallelization. Should we include parallelization at this point or is this a non-issue as typically the processors in ORC-D workflows are already busy doing other things?

Penalty and distance functions

The algorithm depends a lot on a penalty and a distance function. From a library point of view I would like them to be exchangeable with other functions. But from a ocrd-processor/cli perspective this is not really useful.

So should we make the distance and penalty functions configurable?

Use of caching

Because of regularly splitting lines and repeating runs with different coefficients the algorithm needs a lot of caching to speed up the progress. The easiest way to do this is with Python < 3.9 is to use the @lru_cache annotation.

The performance could benefit from a custom tailored cache but it would also add more code we have to maintain.

Performance

~At the moment it takes several minutes to analyse real world paris of ground truth and ocr data.~

mikegerber commented 3 years ago

Awesome! I'll try to review it in the next days

b2m commented 3 years ago

Note that the flexible character accuracy in the current implementation will take minutes on real world data (e.g. the ground truth provided in the tests data folder).

I will try to investigate but I suspect that it is a mix of:

Update: all confirmed and now the performance is "reasonable".

b2m commented 3 years ago

So using multiprocessing is estimated by my local benchmarking to run on 10 CPUs in about 1/2 the time while using ~4 times more cpu cycles.

The reason is that each process has it's own cache and flexible character accuracy relies heavily on caching.

So we can decide to continue with multiprocessing and maybe a shared cache or keep the code slower but simple... (and maybe optimize in the future by running multiple documents in parallel).

mikegerber commented 3 years ago

This is on the agenda for 2021-01, sorry for not doing it earlier :)

jbarth-ubhd commented 3 years ago

if there would be enough cpu power, one could perhaps calculate editDist() of t2.length() ± n characters.

b2m commented 3 years ago

if there would be enough cpu power, one could perhaps calculate editDist() of t2.length() ± n characters.

FCA is using a sliding window to compare a smaller string with a longer one with the considerations of inserts and deletes. For example a is compared three times with cba:

a__   _a_   __a
cba   cba   cba

So there is no need to extend the window by n characters.

b2m commented 3 years ago

Note that this pull request is still blocked by the performance of calculating the distance between two strings in dinglehopper (see #48 for details on that).

mikegerber commented 3 years ago

Note that this pull request is still blocked by the performance of calculating the distance between two strings in dinglehopper (see #48 for details on that).

Or expressed in a more positive matter: I am looking forward to get the necessary features into RapidFuzz and then get this PR merged too 😄