ADD lookup table for levenshtein matrix and tempcaching - Githubissues

qurator-spk / dinglehopper

An OCR evaluation tool

Apache License 2.0

63 stars 15 forks source link

ADD lookup table for levenshtein matrix and tempcaching #7

Closed JKamlah closed 4 years ago

JKamlah commented 4 years ago

Hey,

I've added a lookup table for the levenshtein matrix calculation, which reduces the calculation time between 10%-50%. The second part caches the matrix results temporarily, which skips two out of five levenshtein matrix calculations (with default settings). Hope you like it.

mikegerber commented 4 years ago

Thanks!

I have to review the look-up table next week (also the tests break) :)

Let's discuss the tempcache/memoization first: I am thinking about the possible use case for it? Multiple runs of the tool for the same input data? Why? Also, if there is a need for it, I'd prefer a decorator approach to it rather than weaving memoization into the function itself (similar to this just file-backed perhaps). I'd also like to use os.path.join for most path joining.

(I much prefer multiple PRs for multiple independent changes, too, because it makes reviewing and merging a lot easier. I know that's more work but ... in the end its easier to work with 🙂)

mikegerber commented 4 years ago

Ah maybe I'm mistaken about the caching – I need to review if I calculate the same matrix more than once in one run. If that is the case, it is absolutely a good idea to cache it!

JKamlah commented 4 years ago

Fixed the problem :). The hashname for the tempfiles was not unique.

mikegerber commented 4 years ago

I've implemented caching the Levenshtein matrix in another way I found more elegant: I use the standard library function functools.lru_cache() for it and handle "the hash problem" by converting sequences to tuples as we need the whole sequence anyway and tuples are hashable (in the Python sense). This caches in memory, not in files.

Do you see any reason to cache in files? If so, we should discuss it, but implementation should be in a decorator (like functools.lru_cache() does it).

See https://github.com/qurator-spk/dinglehopper/commit/58ff140bc013702901eca037b1358dc574dc88e1 for the implementation.

JKamlah commented 4 years ago

I really like your decorator functools.lru_cache() version. It is indeed really elegant. I thought if the code would compare the content line or paragraphwise, a filecache could come handy. But i guess the lru_cache fulfils the main purpose in a more efficent way. What do you think about the Levenshtein lookup implementation?

mikegerber commented 4 years ago

I still have to review that change :) Just too busy with other things.

mikegerber commented 4 years ago

~~Reminder to myself: Check memory usage of the cache, also empty the cache between input files when running through OCR-D interface.~~

mikegerber commented 4 years ago

@JKamlah did you close this by accident? I still had to review your algorithm changes.