WebUI: Entering non-ascii into ban notes causes unicode error

cdunklau commented 7 years ago

Clearly a bot

14:35 < cdunklau> don't be afraid people, this is just a test https://gist.github.com/habnabit/3805075
14:35 < neuTrynaFreak> Title: one expression chatserver · GitHub (at gist.github.com)

Pasting this into the ban detail and submitting the form causes a crash.

cdunklau commented 7 years ago

I tried the following tests to reproduce this

import os.path
import unittest

from twisted.web.http_headers import Headers
from genshi.template import TemplateLoader

#from infobat import http

here = os.path.dirname(os.path.abspath(__file__))
infobat_dir = os.path.dirname(here)
source_tree_root = os.path.dirname(infobat_dir)
templates_dir = os.path.join(source_tree_root, 'web')

class IsolatedRenderGenshiTemplateTestCase(unittest.TestCase):
    def assertTemplateRenders(self, template_filename, **template_params):
        """
        Ensure that the template renders without an exception and
        produces a result at least 10 long.

        """
        template = TemplateLoader(templates_dir).load(template_filename)
        result = template.generate(**template_params).render(
            'html', doctype='html5', encoding='utf-8')
        self.assertGreaterEqual(len(result), 10)

#def renderTemplate(request, tmpl, **kwargs):
#    request.setHeader('Content-type', 'text/html; charset=utf-8')
#    request.write(tmpl
#                  .generate(**kwargs)
#                  .render('html', doctype='html5', encoding='utf-8'))
#    request.finish()

    def test_edit_ban_template_renders(self):
        channel = mask = mode = set_at = set_by = \
            expire_at = reason = unset_at = unset_by = u''
        unset_at = u'2016-09-14 12:08:32+00:00'
        reason = u'this person was bad'
        ban = (
            channel, mask, mode, set_at, set_by,
            expire_at, reason, unset_at, unset_by
        )
        self.assertTemplateRenders('edit_ban.html', ban=ban, message=None)

    def test_edit_ban_template_renders_with_nonascii_reason(self):
        channel = mask = mode = set_at = set_by = \
            expire_at = reason = unset_at = unset_by = u''
        unset_at = u'2016-09-14 12:08:32+00:00'
        reason = u'this person was bad and used a dot: \u2022'
        ban = (
            channel, mask, mode, set_at, set_by,
            expire_at, reason, unset_at, unset_by
        )
        self.assertTemplateRenders('edit_ban.html', ban=ban, message=None)

I expect the second test to cause a similar UnicodeDecodeError as seen live on the site, but both tests pass. However, if I change reason = u'this person was bad and used a dot: \u2022' to reason = u'this person was bad and used a dot: \u2022'.encode('utf-8') instead, the following trace occurs, which is nearly the same as what we saw live on the site:

[ERROR]
Traceback (most recent call last):
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/unittest/case.py", line 329, in run
    testMethod()
  File "/Users/cdunklau/Development/infobob/infobat/tests/test_http.py", line 55, in test_edit_ban_template_renders_with_nonascii_reason
    self.assertTemplateRenders('edit_ban.html', ban=ban, message=None)
  File "/Users/cdunklau/Development/infobob/infobat/tests/test_http.py", line 25, in assertTemplateRenders
    'html', doctype='html5', encoding='utf-8')
  File "/Users/cdunklau/Development/infobob/venv/lib/python2.7/site-packages/genshi/core.py", line 184, in render
    return encode(generator, method=method, encoding=encoding, out=out)
  File "/Users/cdunklau/Development/infobob/venv/lib/python2.7/site-packages/genshi/output.py", line 57, in encode
    return _encode(''.join(list(iterator)))
  File "/Users/cdunklau/Development/infobob/venv/lib/python2.7/site-packages/genshi/output.py", line 475, in __call__
    for kind, data, _ in stream:
  File "/Users/cdunklau/Development/infobob/venv/lib/python2.7/site-packages/genshi/output.py", line 829, in __call__
    for kind, data, pos in stream:
  File "/Users/cdunklau/Development/infobob/venv/lib/python2.7/site-packages/genshi/output.py", line 669, in __call__
    for kind, data, pos in stream:
  File "/Users/cdunklau/Development/infobob/venv/lib/python2.7/site-packages/genshi/output.py", line 786, in __call__
    text = escape(pop_text(), quotes=False)
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 36: ordinal not in range(128)

@habnabit removed the offending character from the database and the pages started working again.

This will require some deeper investigation. Since I can't reproduce this with unicode, I suspect that at some point between retrieval and rendering, the reason (which should be unicode, from sqlite) gets encoded to bytes. Perhaps adbapi is doing some strangeness?

cdunklau commented 7 years ago

And I just spotted it. https://github.com/pound-python/infobob/blob/3abbab4c2290ea070eb16f6cf41144cfc558b558/infobat/database.py#L48-L49 sets the sqlite text factory to str, so the template does indeed get called with bytes.

cdunklau commented 7 years ago

Fixing this will be tricky, mainly because we can't be sure that the bytes we get from IRC are in any particular encoding. Most of the time, it's probably UTF-8, and the encoding of certain things we get from the server (nicks, hosts, etc) are likely even more predictable, but the uncertainty is there and should be somehow dealt with.

The first step needs to be writing as many tests as possible, at least for the web side.

An interim solution for this specific issue could just be to decode strings from the db with UTF-8 before handing them to the template... but that's a stopgap measure at best.

cdunklau commented 5 years ago

Trying to handle non-UTF-8 stuff from IRC is almost certainly not worth the effort. The fix for this should just be to assume utf-8 for all bytestrings we want to put into the db, and use one of the str.decode (py2) error handlers. 'replace' would probably be fine, but some kind of escaping would be better so we wouldn't actually lose data. Unfortunately, py2 doesn't offer decoding with 'backslashreplace' (only encoding). Perhaps using a custom codecs.register_error implementation would be reasonable.

pound-python / infobob

WebUI: Entering non-ascii into ban notes causes unicode error #2