Closed JKamlah closed 4 years ago
Thanks!
I have to review the look-up table next week (also the tests break) :)
Let's discuss the tempcache/memoization first: I am thinking about the possible use case for it? Multiple runs of the tool for the same input data? Why? Also, if there is a need for it, I'd prefer a decorator approach to it rather than weaving memoization into the function itself (similar to this just file-backed perhaps). I'd also like to use os.path.join
for most path joining.
(I much prefer multiple PRs for multiple independent changes, too, because it makes reviewing and merging a lot easier. I know that's more work but ... in the end its easier to work with 🙂)
Ah maybe I'm mistaken about the caching – I need to review if I calculate the same matrix more than once in one run. If that is the case, it is absolutely a good idea to cache it!
Fixed the problem :). The hashname for the tempfiles was not unique.
I've implemented caching the Levenshtein matrix in another way I found more elegant: I use the standard library function functools.lru_cache()
for it and handle "the hash problem" by converting sequences to tuples as we need the whole sequence anyway and tuples are hashable (in the Python sense). This caches in memory, not in files.
Do you see any reason to cache in files? If so, we should discuss it, but implementation should be in a decorator (like functools.lru_cache()
does it).
See https://github.com/qurator-spk/dinglehopper/commit/58ff140bc013702901eca037b1358dc574dc88e1 for the implementation.
I really like your decorator functools.lru_cache()
version. It is indeed really elegant. I thought if the code would compare the content line or paragraphwise, a filecache could come handy. But i guess the lru_cache fulfils the main purpose in a more efficent way. What do you think about the Levenshtein lookup implementation?
I still have to review that change :) Just too busy with other things.
Reminder to myself: Check memory usage of the cache, also empty the cache between input files when running through OCR-D interface.
@JKamlah did you close this by accident? I still had to review your algorithm changes.
Hey,
I've added a lookup table for the levenshtein matrix calculation, which reduces the calculation time between 10%-50%. The second part caches the matrix results temporarily, which skips two out of five levenshtein matrix calculations (with default settings). Hope you like it.