pybabel update fuzzy algorithm is not the same as msgmerge

ghost commented 7 years ago

The msgmerge gettext algorithm to find and mark fuzzy translations will find that, for instance, replacing a trailing exclamation point with a trailing dot is fuzzy.

#: journalist.py:308
#, fuzzy
msgid "Account updated."
msgstr "Le compte a été mis à jour avec succès !"

The pybabel update function uses a different algorithm which leads to significantly different outcomes.

I will revert to using msgmerge because I prefer the fuzzy algorithm it has. It would be good if this significant difference is documented in the help of pybabel update. Or even better, if the user could chose between the two algorithm.

What do you think ?

akx commented 7 years ago

Sure, an option for choosing between fuzziness algorithms would be fine by me.

In case you're interested in writing a PR that implements the msgmerge fuzziness algorithm in Python, the C sources seem to be this fragment in msgmerge, which has historically (before https://github.com/autotools-mirror/gettext/commit/101de35be7c8e73329d8801f9eca938690130cb1 in 2008) used this fuzzy-search implementation but uses this hash-based implementation these days. (I have no idea whether the old and new implementations differ in results – sorry!)

The "meat" of Gettext's fuzziness function, fstrcmp_bounded, can be found here in gnulib. I believe it's an implementation of the Levenshtein algorithm for edit distance, while difflib.get_close_matches(), which Babel uses, is something quite different.

Hope that helps.

ghost commented 7 years ago

Sure, an option for choosing between fuzziness algorithms would be fine by me.

To be completely honest I don't have the fuel to do that myself right now. I hope an enthusiastic volunteer with a lot of spare time will discover this discussion and do some magic :-)

python-babel / babel

pybabel update fuzzy algorithm is not the same as msgmerge #524