Closed ghost closed 1 year ago
Sure, an option for choosing between fuzziness algorithms would be fine by me.
In case you're interested in writing a PR that implements the msgmerge
fuzziness algorithm in Python, the C sources seem to be this fragment in msgmerge
, which has historically (before https://github.com/autotools-mirror/gettext/commit/101de35be7c8e73329d8801f9eca938690130cb1 in 2008) used this fuzzy-search implementation but uses this hash-based implementation these days. (I have no idea whether the old and new implementations differ in results – sorry!)
The "meat" of Gettext's fuzziness function, fstrcmp_bounded
, can be found here in gnulib
. I believe it's an implementation of the Levenshtein algorithm for edit distance, while difflib.get_close_matches()
, which Babel uses, is something quite different.
Hope that helps.
Sure, an option for choosing between fuzziness algorithms would be fine by me.
To be completely honest I don't have the fuel to do that myself right now. I hope an enthusiastic volunteer with a lot of spare time will discover this discussion and do some magic :-)
The msgmerge gettext algorithm to find and mark fuzzy translations will find that, for instance, replacing a trailing exclamation point with a trailing dot is fuzzy.
The pybabel update function uses a different algorithm which leads to significantly different outcomes.
I will revert to using msgmerge because I prefer the fuzzy algorithm it has. It would be good if this significant difference is documented in the help of pybabel update. Or even better, if the user could chose between the two algorithm.
What do you think ?