Switch from custom Levenshtein to python-Levenshtein

b2m commented 3 years ago

As the distance and editops calculation is a performance bottleneck in this application I substituted the custom Levenshtein implementation to the C implementation in the python-Levenshtein package.

We now also have separate entrypoints for texts with unicode normalization and without. For example when calculating the flexible character accuracy in #47 the normalization can be done more efficiently once upon preprocessing.

Here are some benchmarks for the behaviour before and after this change using data already available in the repository:

Command	Mean [s]	Min [s]	Max [s]	Relative
before brochrnx_73075507X	3.359 ± 0.039	3.305	3.408	8.29 ± 0.13
after brochrnx_73075507X	0.405 ± 0.004	0.399	0.409	1.00
before actevedef_718448162 CALAMARI	34.410 ± 0.561	33.918	35.362	84.91 ± 1.61
after actevedef_718448162 CALAMARI	0.926 ± 0.010	0.911	0.935	2.28 ± 0.03
before actevedef_718448162 TESS	34.103 ± 0.305	33.685	34.529	84.16 ± 1.11
after actevedef_718448162 TESS	0.909 ± 0.008	0.899	0.921	2.24 ± 0.03

Boxplot generated by hyperfine

Statistics as JSON
For details on how I created the benchmarks see the gist Performance benchmarking for dinglehopper using hyperfine.
This also adresses point 1 in #44.

Note that with this changes we can only compare unicode documents with a total maximum of 1 114 111 unique characters (or words). The limit comes from Python's chr function. But I suspect this should not be an issue at the moment or the near future.

b2m commented 3 years ago

Thanks for the contribution! To be honest this would be a lot easier to review if you hadn't refactored all of the relevant tests.

Yes... I refactored the tests (1st commit) to be able to test the original implementation and the new implementation in parallel (2nd commit) and then replace the original implementation (3rd commit).

I then messed it up during the second commit because I overlooked some tests.

So you can still go back to the second commit (fa617e8) and see both implementations work in parallel.

mikegerber commented 3 years ago

I had a Gitter chat iwth @b2m: I'll 1. cherry pick the test refactoring and 2. see if I can come up with a Cython implementation of the Wagner-Fisher algorithm - because I am not so super happy with the proposed transform_lists. If that is not working out, I'll try to be happy with the encode-list-elements-as-strings hack ;-)

(Best solution: Add a generic solution to python-Levenshtein)

mikegerber commented 3 years ago

There is another remark I should make: There are other sequence matching algorithms and I am not sure yet if I want them in here because they do not necessarily output the optimal solution.

b2m commented 3 years ago

(Best solution: Add a generic solution to python-Levenshtein)

I agree that the best solution would be a contribution to the python-Levenshtein package. But I lack the experience in C to do this on a production grade level 😞.

mikegerber commented 3 years ago

This is on the agenda for 2021-01, sorry for not doing it earlier :) Merry christmas!

(I actually have a naive Cython implementation which is faster than the pure Python implementation. More on that in January!)

maxbachmann commented 3 years ago

One thing to keep in mind is that python-Levenshtein is GPL licensed. When using an external Levenshtein distance implementation it might better to use a different library with a more fitting license. I did a performance comparision in the documentation of my own library (RapidFuzz) https://maxbachmann.github.io/RapidFuzz/string_metric.html#levenshtein and afaik there are multiple implementations, which are more open licensed and faster than python-Levenshtein.

mikegerber commented 3 years ago

@maxbachmann Thanks, I will definitely check it out! Does your implementation allow for arbitrary sequences or is it limited to str? If you know any other implementations, please feel free to tell me :) (There are a lot of sequence alignment algorithms, but I don't know how they behave exactly, so I opted to use the basic "optimal" algorithm, i.e. shortest alignment)

mikegerber commented 3 years ago

@maxbachmann To elaborate a bit:

In dinglehopper, we need to align both sequences of Unicode characters and also sequences of words. (We need to align those, too, not only compute the edit distance.)

When aligning strings (sequences of Unicode characters) we also need to address a problem that comes up especially when dealing with OCR from historic prints: characters that cannot be represented with a single codepoints. For example, in "Schlym̃" there is a code point sequence/grapheme cluster LATIN SMALL LETTER M + COMBINING TILDE that cannot be normalized to a single code point. This need to be treated as one "character" when aligning/calculating in one way or another.

maxbachmann commented 3 years ago

1) at least right now RapidFuzz only calculates the edit distance and does not has a feature to return the editops. However it mostly does not exist, because I did not have the need for it yet. A implementation in RapidFuzz would probably perform pretty much exactly as well as it does in python-Levenshtein, since it simply extracts one possible path through the Levenshtein matrix. 2) both python-Levenshtein and RapidFuzz currently only work with strings. E.g. https://github.com/roy-ht/editdistance will always calculate the edit distance between two sequences where each element has to be hashable, which would work with a list of words as well. I opened https://github.com/maxbachmann/RapidFuzz/issues/100 in RapidFuzz to discuss the potential implementation. 3) Similar to python-Levenshtein RapidFuzz works with single code points. So e.g. "Schlym̃", which is represented as 7 characters in Python will be interpreted as 7 characters by RapidFuzz as well. I guess when adding the extension from 2) you could pass the clustered version into the algorithms which would work as expected (especially when you use a string multiple times it would make sense to pass a array.array('Q',...), so the hashes do not have to be calculated multiple times.

mikegerber commented 3 years ago

Alright, I decided something:

I am going to merge this PR in the coming days, even though I am not happy with some design decisions, e.g. encoding arbitrary data as a string. The quality of the PR is exceptionally good, it's just that design that held me up.
This is provided it solves some of the performance problems we observed (@cneud mentioned 2 hours processing time for one newspaper page 🐌)
I might do some work on this when merging
If @maxbachmann's RapidFuzz provides (in the future) editops and the ability to process arbitrary sequences/array('Q'), we'll switch to that

b2m commented 3 years ago

We still have the GPL-Problem @maxbachmann mentioned.

Dinglehopper is licensed as Apache 2 and therefore can not use a GPL licensed python module (AFAIK).

I would consider it a showstopper for this PR and have not found a suitable replacement that has the same functionality (edit operations + edit distance).

maxbachmann commented 3 years ago

If @maxbachmann's RapidFuzz provides (in the future) editops and the ability to process arbitrary sequences/array('Q'), we'll switch to that

I am currently working on this. Will let you know when it is added.

mikegerber commented 3 years ago

@b2m I'll investigate this. I hate licensing problems :-) (My last review comment is weird, it's old and I don't know how to delete it :) )

@maxbachmann Awesome!

mikegerber commented 3 years ago

While I am not a lawyer, I concluded that using the python-Levensthein library (GPLv2) constitues "linking" the library and we are not allowed to do that without making dinglehopper GPL, too. So the python-Levensthein library cannot be used, unfortunately. (Using GPL tools in e.g. Makefiles seems to be fine, though, as these tools are used as separate programs.)

I'll make my team aware of this problem. @cneud

@kba This might be of interest for the OCR-D community, too.

Side not about GPLv3: As I understand this page, we also can't use GPLv3 licensed libraries in our Apache-licensed software. (It would work the other way around, but AFAICT only with GPLv3, not GPLv2.)

maxbachmann commented 3 years ago

While I am not a lawyer, I concluded that using the python-Levensthein library (GPLv2) constitues "linking" the library and we are not allowed to do that without making dinglehopper GPL

GPL and Python are pretty much an open topic, since I am not aware of any law suit on the topic. Generally GPLv2/3 applies when linking the library (both static and dynamic), but it is unclear what a Python import counts as. So it is unclear how the GPL has to be applied in Python. However it is definitely simpler to use a library with a more open license, since it completely avoids those issues. As a note e.g. FuzzyWuzzy was originally MIT licensed, but got GPL licensed because they use python-Levenshtein and were unsure about this as well.

mikegerber commented 3 years ago

GPL and Python are pretty much an open topic, since I am not aware of any law suit on the topic. Generally GPLv2/3 applies when linking the library (both static and dynamic), but it is unclear what a Python import counts as.

I am almost convinced that importing a Python library is essentially dynamic linking but I agree it is not 100% clear. However, I have no doubt in this case because of the SO site-packages/Levenshtein/_levenshtein.cpython-38-x86_64-linux-gnu.so ;-)

I just checked all other (first level) dependencies and all are licensed with less strict licenses (MIT, Apache, BSD). So I think we will go down this route for this aspect of the software as well. (Not discussed with the team yet, but I doubt there is anyone in my team who wants to fist-fight the Free Software Foundation over this.)

b2m commented 3 years ago

@mikegerber License checking could also become a part of CI/CD. If this is of interest feel free to open an issue and I will have a look into it. Usually was doing this for Java and with GitLab but I guess the knowledge is transferable to Circle CI or GitHub (Actions) =)

Should I then close this Pull Request and extract the necessary information regarding of this thread into a separate ticket?

kba commented 3 years ago

@mikegerber License checking could also become a part of CI/CD. If this is of interest feel free to open an issue and I will have a look into it. Usually was doing this for Java and with GitLab but I guess the knowledge is transferable to Circle CI or GitHub (Actions)

I am fairly certain that we're not tainted by license incompatibilities anywhere but it would be nice to be more than "fairly certain". So if you do set up something like this, let me know, it might be useful to do this as part of the CI/CD of ocrd_all.

mikegerber commented 3 years ago

@mikegerber License checking could also become a part of CI/CD. If this is of interest feel free to open an issue and I will have a look into it. Usually was doing this for Java and with GitLab but I guess the knowledge is transferable to Circle CI or GitHub (Actions) =)

I've reopened #54 for this!

Should I then close this Pull Request and extract the necessary information regarding of this thread into a separate ticket?

Yes, that would be nice!

b2m commented 3 years ago

Closing this pull request and opened #56 to keep track of a more appropriate solution.

qurator-spk / dinglehopper

Switch from custom Levenshtein to python-Levenshtein #48