Closed cneud closed 11 years ago
Some text files use Unicode line separator (U2028) or paragraph separator (U2029) for text instead of the traditional CR/LF (while some others do not). They can be done equivalent to U0020 (a white space for the purpose of comparioson between files) in the replacements.txt file (current version of this file also contains equivalences for Impact unicodes in the private area and for some rarely supported codes). Apparently the regular expression "\p{Space}" does not match U2028.
For two sentences of length l1 and l2 respectively (number of words) the indel distance d and the number delta of differences are realted by: d = abs(l1 - l2) + delta On the other hand, the number of errors is n = abs(l1 - l2) + delta and therefore n = 0.5 * (abs(1 - l2) + d)
Some common additional string comparison methods may be useful to implement, e.g.