Closed b2m closed 3 years ago
Awesome! I'll try to review it in the next days
Note that the flexible character accuracy in the current implementation will take minutes on real world data (e.g. the ground truth provided in the tests data folder).
I will try to investigate but I suspect that it is a mix of:
Update: all confirmed and now the performance is "reasonable".
So using multiprocessing
is estimated by my local benchmarking to run on 10 CPUs in about 1/2 the time while using ~4 times more cpu cycles.
The reason is that each process has it's own cache and flexible character accuracy relies heavily on caching.
So we can decide to continue with multiprocessing
and maybe a shared cache or keep the code slower but simple... (and maybe optimize in the future by running multiple documents in parallel).
This is on the agenda for 2021-01, sorry for not doing it earlier :)
if there would be enough cpu power, one could perhaps calculate editDist() of t2.length() ± n
characters.
if there would be enough cpu power, one could perhaps calculate editDist() of
t2.length() ± n
characters.
FCA is using a sliding window to compare a smaller string with a longer one with the considerations of inserts and deletes.
For example a
is compared three times with cba
:
a__ _a_ __a
cba cba cba
So there is no need to extend the window by n characters.
Note that this pull request is still blocked by the performance of calculating the distance between two strings in dinglehopper (see #48 for details on that).
Note that this pull request is still blocked by the performance of calculating the distance between two strings in dinglehopper (see #48 for details on that).
Or expressed in a more positive matter: I am looking forward to get the necessary features into RapidFuzz and then get this PR merged too 😄
This is a first draft for adding the flexible character accuracy as suggested by @cneud in #32.
There are still some open topics so I opened this pull request as draft so you may already comment on some issues.
Handling of coefficients
The algorithm uses a "range of coefficients for penalty calculation" (see Table 1 in the paper).
Should we make the coefficients configurable? If so, we might need a configuration file because handling 12 additional parameters on the command line is quite messy.
The runs for each set of coefficients is also a good place for parallelization. Should we include parallelization at this point or is this a non-issue as typically the processors in ORC-D workflows are already busy doing other things?
Penalty and distance functions
The algorithm depends a lot on a penalty and a distance function. From a library point of view I would like them to be exchangeable with other functions. But from a ocrd-processor/cli perspective this is not really useful.
So should we make the distance and penalty functions configurable?
Use of caching
Because of regularly splitting lines and repeating runs with different coefficients the algorithm needs a lot of caching to speed up the progress. The easiest way to do this is with Python < 3.9 is to use the @lru_cache annotation.
The performance could benefit from a custom tailored cache but it would also add more code we have to maintain.
Performance
~At the moment it takes several minutes to analyse real world paris of ground truth and ocr data.~