Closed b2m closed 3 years ago
Thanks for the contribution! To be honest this would be a lot easier to review if you hadn't refactored all of the relevant tests.
Yes... I refactored the tests (1st commit) to be able to test the original implementation and the new implementation in parallel (2nd commit) and then replace the original implementation (3rd commit).
I then messed it up during the second commit because I overlooked some tests.
So you can still go back to the second commit (fa617e8) and see both implementations work in parallel.
I had a Gitter chat iwth @b2m: I'll 1. cherry pick the test refactoring and 2. see if I can come up with a Cython implementation of the Wagner-Fisher algorithm - because I am not so super happy with the proposed transform_lists
. If that is not working out, I'll try to be happy with the encode-list-elements-as-strings hack ;-)
(Best solution: Add a generic solution to python-Levenshtein)
There is another remark I should make: There are other sequence matching algorithms and I am not sure yet if I want them in here because they do not necessarily output the optimal solution.
(Best solution: Add a generic solution to python-Levenshtein)
I agree that the best solution would be a contribution to the python-Levenshtein package. But I lack the experience in C to do this on a production grade level 😞.
This is on the agenda for 2021-01, sorry for not doing it earlier :) Merry christmas!
(I actually have a naive Cython implementation which is faster than the pure Python implementation. More on that in January!)
One thing to keep in mind is that python-Levenshtein is GPL licensed. When using an external Levenshtein distance implementation it might better to use a different library with a more fitting license. I did a performance comparision in the documentation of my own library (RapidFuzz) https://maxbachmann.github.io/RapidFuzz/string_metric.html#levenshtein and afaik there are multiple implementations, which are more open licensed and faster than python-Levenshtein.
@maxbachmann Thanks, I will definitely check it out! Does your implementation allow for arbitrary sequences or is it limited to str
? If you know any other implementations, please feel free to tell me :) (There are a lot of sequence alignment algorithms, but I don't know how they behave exactly, so I opted to use the basic "optimal" algorithm, i.e. shortest alignment)
@maxbachmann To elaborate a bit:
In dinglehopper, we need to align both sequences of Unicode characters and also sequences of words. (We need to align those, too, not only compute the edit distance.)
When aligning strings (sequences of Unicode characters) we also need to address a problem that comes up especially when dealing with OCR from historic prints: characters that cannot be represented with a single codepoints. For example, in "Schlym̃" there is a code point sequence/grapheme cluster LATIN SMALL LETTER M + COMBINING TILDE that cannot be normalized to a single code point. This need to be treated as one "character" when aligning/calculating in one way or another.
1) at least right now RapidFuzz only calculates the edit distance and does not has a feature to return the editops. However it mostly does not exist, because I did not have the need for it yet. A implementation in RapidFuzz would probably perform pretty much exactly as well as it does in python-Levenshtein, since it simply extracts one possible path through the Levenshtein matrix. 2) both python-Levenshtein and RapidFuzz currently only work with strings. E.g. https://github.com/roy-ht/editdistance will always calculate the edit distance between two sequences where each element has to be hashable, which would work with a list of words as well. I opened https://github.com/maxbachmann/RapidFuzz/issues/100 in RapidFuzz to discuss the potential implementation. 3) Similar to python-Levenshtein RapidFuzz works with single code points. So e.g. "Schlym̃", which is represented as 7 characters in Python will be interpreted as 7 characters by RapidFuzz as well. I guess when adding the extension from 2) you could pass the clustered version into the algorithms which would work as expected (especially when you use a string multiple times it would make sense to pass a array.array('Q',...), so the hashes do not have to be calculated multiple times.
Alright, I decided something:
We still have the GPL-Problem @maxbachmann mentioned.
Dinglehopper is licensed as Apache 2 and therefore can not use a GPL licensed python module (AFAIK).
I would consider it a showstopper for this PR and have not found a suitable replacement that has the same functionality (edit operations + edit distance).
If @maxbachmann's RapidFuzz provides (in the future) editops and the ability to process arbitrary sequences/array('Q'), we'll switch to that
I am currently working on this. Will let you know when it is added.
@b2m I'll investigate this. I hate licensing problems :-) (My last review comment is weird, it's old and I don't know how to delete it :) )
@maxbachmann Awesome!
While I am not a lawyer, I concluded that using the python-Levensthein library (GPLv2) constitues "linking" the library and we are not allowed to do that without making dinglehopper GPL, too. So the python-Levensthein library cannot be used, unfortunately. (Using GPL tools in e.g. Makefiles seems to be fine, though, as these tools are used as separate programs.)
I'll make my team aware of this problem. @cneud
@kba This might be of interest for the OCR-D community, too.
Side not about GPLv3: As I understand this page, we also can't use GPLv3 licensed libraries in our Apache-licensed software. (It would work the other way around, but AFAICT only with GPLv3, not GPLv2.)
While I am not a lawyer, I concluded that using the python-Levensthein library (GPLv2) constitues "linking" the library and we are not allowed to do that without making dinglehopper GPL
GPL and Python are pretty much an open topic, since I am not aware of any law suit on the topic. Generally GPLv2/3 applies when linking the library (both static and dynamic), but it is unclear what a Python import counts as. So it is unclear how the GPL has to be applied in Python. However it is definitely simpler to use a library with a more open license, since it completely avoids those issues. As a note e.g. FuzzyWuzzy was originally MIT licensed, but got GPL licensed because they use python-Levenshtein and were unsure about this as well.
GPL and Python are pretty much an open topic, since I am not aware of any law suit on the topic. Generally GPLv2/3 applies when linking the library (both static and dynamic), but it is unclear what a Python import counts as.
I am almost convinced that importing a Python library is essentially dynamic linking but I agree it is not 100% clear. However, I have no doubt in this case because of the SO site-packages/Levenshtein/_levenshtein.cpython-38-x86_64-linux-gnu.so
;-)
I just checked all other (first level) dependencies and all are licensed with less strict licenses (MIT, Apache, BSD). So I think we will go down this route for this aspect of the software as well. (Not discussed with the team yet, but I doubt there is anyone in my team who wants to fist-fight the Free Software Foundation over this.)
@mikegerber License checking could also become a part of CI/CD. If this is of interest feel free to open an issue and I will have a look into it. Usually was doing this for Java and with GitLab but I guess the knowledge is transferable to Circle CI or GitHub (Actions) =)
Should I then close this Pull Request and extract the necessary information regarding of this thread into a separate ticket?
@mikegerber License checking could also become a part of CI/CD. If this is of interest feel free to open an issue and I will have a look into it. Usually was doing this for Java and with GitLab but I guess the knowledge is transferable to Circle CI or GitHub (Actions)
I am fairly certain that we're not tainted by license incompatibilities anywhere but it would be nice to be more than "fairly certain". So if you do set up something like this, let me know, it might be useful to do this as part of the CI/CD of ocrd_all.
@mikegerber License checking could also become a part of CI/CD. If this is of interest feel free to open an issue and I will have a look into it. Usually was doing this for Java and with GitLab but I guess the knowledge is transferable to Circle CI or GitHub (Actions) =)
I've reopened #54 for this!
Should I then close this Pull Request and extract the necessary information regarding of this thread into a separate ticket?
Yes, that would be nice!
Closing this pull request and opened #56 to keep track of a more appropriate solution.
As the
distance
andeditops
calculation is a performance bottleneck in this application I substituted the custom Levenshtein implementation to the C implementation in the python-Levenshtein package.We now also have separate entrypoints for texts with unicode normalization and without. For example when calculating the flexible character accuracy in #47 the normalization can be done more efficiently once upon preprocessing.
Here are some benchmarks for the behaviour before and after this change using data already available in the repository:
Note that with this changes we can only compare unicode documents with a total maximum of 1 114 111 unique characters (or words). The limit comes from Python's chr function. But I suspect this should not be an issue at the moment or the near future.