Open cdunklau opened 7 years ago
I tried the following tests to reproduce this
import os.path
import unittest
from twisted.web.http_headers import Headers
from genshi.template import TemplateLoader
#from infobat import http
here = os.path.dirname(os.path.abspath(__file__))
infobat_dir = os.path.dirname(here)
source_tree_root = os.path.dirname(infobat_dir)
templates_dir = os.path.join(source_tree_root, 'web')
class IsolatedRenderGenshiTemplateTestCase(unittest.TestCase):
def assertTemplateRenders(self, template_filename, **template_params):
"""
Ensure that the template renders without an exception and
produces a result at least 10 long.
"""
template = TemplateLoader(templates_dir).load(template_filename)
result = template.generate(**template_params).render(
'html', doctype='html5', encoding='utf-8')
self.assertGreaterEqual(len(result), 10)
#def renderTemplate(request, tmpl, **kwargs):
# request.setHeader('Content-type', 'text/html; charset=utf-8')
# request.write(tmpl
# .generate(**kwargs)
# .render('html', doctype='html5', encoding='utf-8'))
# request.finish()
def test_edit_ban_template_renders(self):
channel = mask = mode = set_at = set_by = \
expire_at = reason = unset_at = unset_by = u''
unset_at = u'2016-09-14 12:08:32+00:00'
reason = u'this person was bad'
ban = (
channel, mask, mode, set_at, set_by,
expire_at, reason, unset_at, unset_by
)
self.assertTemplateRenders('edit_ban.html', ban=ban, message=None)
def test_edit_ban_template_renders_with_nonascii_reason(self):
channel = mask = mode = set_at = set_by = \
expire_at = reason = unset_at = unset_by = u''
unset_at = u'2016-09-14 12:08:32+00:00'
reason = u'this person was bad and used a dot: \u2022'
ban = (
channel, mask, mode, set_at, set_by,
expire_at, reason, unset_at, unset_by
)
self.assertTemplateRenders('edit_ban.html', ban=ban, message=None)
I expect the second test to cause a similar UnicodeDecodeError as seen live on the site, but both tests pass. However, if I change reason = u'this person was bad and used a dot: \u2022'
to reason = u'this person was bad and used a dot: \u2022'.encode('utf-8')
instead, the following trace occurs, which is nearly the same as what we saw live on the site:
[ERROR]
Traceback (most recent call last):
File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/unittest/case.py", line 329, in run
testMethod()
File "/Users/cdunklau/Development/infobob/infobat/tests/test_http.py", line 55, in test_edit_ban_template_renders_with_nonascii_reason
self.assertTemplateRenders('edit_ban.html', ban=ban, message=None)
File "/Users/cdunklau/Development/infobob/infobat/tests/test_http.py", line 25, in assertTemplateRenders
'html', doctype='html5', encoding='utf-8')
File "/Users/cdunklau/Development/infobob/venv/lib/python2.7/site-packages/genshi/core.py", line 184, in render
return encode(generator, method=method, encoding=encoding, out=out)
File "/Users/cdunklau/Development/infobob/venv/lib/python2.7/site-packages/genshi/output.py", line 57, in encode
return _encode(''.join(list(iterator)))
File "/Users/cdunklau/Development/infobob/venv/lib/python2.7/site-packages/genshi/output.py", line 475, in __call__
for kind, data, _ in stream:
File "/Users/cdunklau/Development/infobob/venv/lib/python2.7/site-packages/genshi/output.py", line 829, in __call__
for kind, data, pos in stream:
File "/Users/cdunklau/Development/infobob/venv/lib/python2.7/site-packages/genshi/output.py", line 669, in __call__
for kind, data, pos in stream:
File "/Users/cdunklau/Development/infobob/venv/lib/python2.7/site-packages/genshi/output.py", line 786, in __call__
text = escape(pop_text(), quotes=False)
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 36: ordinal not in range(128)
@habnabit removed the offending character from the database and the pages started working again.
This will require some deeper investigation. Since I can't reproduce this with unicode, I suspect that at some point between retrieval and rendering, the reason
(which should be unicode, from sqlite) gets encoded to bytes. Perhaps adbapi is doing some strangeness?
And I just spotted it. https://github.com/pound-python/infobob/blob/3abbab4c2290ea070eb16f6cf41144cfc558b558/infobat/database.py#L48-L49 sets the sqlite text factory to str, so the template does indeed get called with bytes.
Fixing this will be tricky, mainly because we can't be sure that the bytes we get from IRC are in any particular encoding. Most of the time, it's probably UTF-8, and the encoding of certain things we get from the server (nicks, hosts, etc) are likely even more predictable, but the uncertainty is there and should be somehow dealt with.
The first step needs to be writing as many tests as possible, at least for the web side.
An interim solution for this specific issue could just be to decode strings from the db with UTF-8 before handing them to the template... but that's a stopgap measure at best.
Trying to handle non-UTF-8 stuff from IRC is almost certainly not worth the effort. The fix for this should just be to assume utf-8 for all bytestrings we want to put into the db, and use one of the str.decode
(py2) error handlers. 'replace'
would probably be fine, but some kind of escaping would be better so we wouldn't actually lose data. Unfortunately, py2 doesn't offer decoding with 'backslashreplace'
(only encoding). Perhaps using a custom codecs.register_error
implementation would be reasonable.
Pasting this into the ban detail and submitting the form causes a crash.