impactcentre / ocrevalUAtion

OCR evaluation brought to you by University of Alicante
Apache License 2.0
66 stars 27 forks source link

Additional string comparison methods #3

Closed cneud closed 10 years ago

cneud commented 10 years ago

Some common additional string comparison methods may be useful to implement, e.g.

rccarrasco commented 10 years ago

Some text files use Unicode line separator (U2028) or paragraph separator (U2029) for text instead of the traditional CR/LF (while some others do not). They can be done equivalent to U0020 (a white space for the purpose of comparioson between files) in the replacements.txt file (current version of this file also contains equivalences for Impact unicodes in the private area and for some rarely supported codes). Apparently the regular expression "\p{Space}" does not match U2028.

rccarrasco commented 10 years ago

For two sentences of length l1 and l2 respectively (number of words) the indel distance d and the number delta of differences are realted by: d = abs(l1 - l2) + delta On the other hand, the number of errors is n = abs(l1 - l2) + delta and therefore n = 0.5 * (abs(1 - l2) + d)