Closed durka closed 9 years ago
I found the bug. Right here the comment text is automatically word-wrapped to 75 columns. When there are no obvious places to break the line (like a space), this is done by blindly inserting a \n at position 75 (0x0A just so happens to be the character code for \n).
Sure enough:
>>> u"ェント/頭のいい/さかしい/利口ということを意味し".encode('utf8')[72:74]
'\xe3\x81'
So inserting a \n at position 75 will cause the invalid sequence 0xE3 0x81 0x0A which Perl complains about. This also explains why some apparently similar strings cause no problems (because of luck) and why the problem does not appear on the definition or notes fields (because those are not word wrapped).
I believe the fix is to drop in Text::WrapI18N instead of Text::Wrap.
Fixing the word wrap exposed a bug in the encoding: Comments are miscoded before being stored in the database.
Attempting to put some strings in the comment fields causes a strange error. The same strings may be used for definition/notes fields, it seems.
See http://jbovlaste.lojban.org/dict/test. If you copy the notes and attempt to post that as a comment, this error results:
Vexingly, the erroneous sequence does not appear to exist in the posted string:
However, Python can give a bit of a more specific description of the error:
This was discovered by @Ilmen-vodhr.