lojban / jbovlaste

http://jbovlaste.lojban.org
31 stars 11 forks source link

comment word-wrapping broken for UTF-8 #160

Closed durka closed 9 years ago

durka commented 9 years ago

Attempting to put some strings in the comment fields causes a strange error. The same strings may be used for definition/notes fields, it seems.

See http://jbovlaste.lojban.org/dict/test. If you copy the notes and attempt to post that as a comment, this error results: DBD::Pg::db do failed: ERROR: invalid byte sequence for encoding "UTF8": 0xe3 0x81 0x0a at /srv/jbovlaste/current/post.html line 121.

Vexingly, the erroneous sequence does not appear to exist in the posted string:

>>> u"ェント/頭のいい/さかしい/利口ということを意味し".encode('utf8').index('\x0a')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found

However, Python can give a bit of a more specific description of the error:

>>> "\xe3\x81\x0a".decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte

This was discovered by @Ilmen-vodhr.

durka commented 9 years ago

I found the bug. Right here the comment text is automatically word-wrapped to 75 columns. When there are no obvious places to break the line (like a space), this is done by blindly inserting a \n at position 75 (0x0A just so happens to be the character code for \n).

Sure enough:

>>> u"ェント/頭のいい/さかしい/利口ということを意味し".encode('utf8')[72:74]
'\xe3\x81'

So inserting a \n at position 75 will cause the invalid sequence 0xE3 0x81 0x0A which Perl complains about. This also explains why some apparently similar strings cause no problems (because of luck) and why the problem does not appear on the definition or notes fields (because those are not word wrapped).

I believe the fix is to drop in Text::WrapI18N instead of Text::Wrap.

teleological commented 9 years ago

Fixing the word wrap exposed a bug in the encoding: Comments are miscoded before being stored in the database.