py-bson / bson

Independent BSON codec for Python that doesn't depend on MongoDB.
Other
443 stars 82 forks source link

encode_cstring raises UnicodeEncodeError on Python 2 with non-ASCII Unicode values #97

Closed cjwatson closed 4 years ago

cjwatson commented 4 years ago

I ran into this test failure while trying to upgrade Launchpad's version of bson from 0.3.3 to 0.5.8:

Error in test lp.services.webapp.tests.test_errorlog.TestErrorReportingUtility.test_raising_with_request
Traceback (most recent call last):
_StringException: librarian-log: {{{
2020-02-07 18:34:49+0530 [-] Loading /home/cjwatson/src/canonical/launchpad/git/launchpad/daemons/librarian.tac...
2020-02-07 18:35:01+0530 [-] Loaded.
2020-02-07 18:35:01+0530 [-] twistd 19.2.1 (/home/cjwatson/src/canonical/launchpad/git/launchpad/env/bin/python2.7 2.7.12) starting up.
2020-02-07 18:35:01+0530 [-] reactor class: twisted.internet.epollreactor.EPollReactor.
2020-02-07 18:35:01+0530 [-] FileUploadFactory starting on 34971
2020-02-07 18:35:01+0530 [-] Starting factory <lp.services.librarianserver.libraryprotocol.FileUploadFactory instance at 0xf0db438c>
2020-02-07 18:35:01+0530 [-] Site starting on 41285
2020-02-07 18:35:01+0530 [-] Starting factory <twisted.web.server.Site instance at 0xf0db492c>
2020-02-07 18:35:01+0530 [-] FileUploadFactory starting on 34569
2020-02-07 18:35:01+0530 [-] Starting factory <lp.services.librarianserver.libraryprotocol.FileUploadFactory instance at 0xf0db458c>
2020-02-07 18:35:01+0530 [-] Site starting on 33495
2020-02-07 18:35:01+0530 [-] Starting factory <twisted.web.server.Site instance at 0xf0db46ec>
2020-02-07 18:35:01+0530 [-] Not using upstream librarian
2020-02-07 18:35:01+0530 [-] daemon ready!
}}}

Traceback (most recent call last):
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/lib/lp/services/webapp/tests/test_errorlog.py", line 147, in test_raising_with_request
    report = utility.raising(sys.exc_info(), request)
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/lib/lp/services/webapp/errorlog.py", line 379, in raising
    if self._oops_config.publish(report) is None:
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/env/local/lib/python2.7/site-packages/oops/config.py", line 218, in publish
    ret = publisher(report)
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/env/local/lib/python2.7/site-packages/oops/publishers.py", line 113, in result
    ret.extend(publisher(report))
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/env/local/lib/python2.7/site-packages/oops/publishers.py", line 113, in result
    ret.extend(publisher(report))
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/env/local/lib/python2.7/site-packages/oops/publishers.py", line 85, in result
    ret.extend(publisher(report))
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/env/local/lib/python2.7/site-packages/oops_amqp/publisher.py", line 86, in __call__
    oops_id = "OOPS-%s" % md5(dumps(report)).hexdigest()
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/env/local/lib/python2.7/site-packages/bson/__init__.py", line 40, in dumps
    generator_func=generator, on_unknown=on_unknown)
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/env/local/lib/python2.7/site-packages/bson/codec.py", line 237, in encode_document
    generator_func, on_unknown)
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/env/local/lib/python2.7/site-packages/bson/codec.py", line 210, in encode_value
    generator_func, on_unknown))
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/env/local/lib/python2.7/site-packages/bson/codec.py", line 360, in encode_document_element
    generator_func=generator_func, on_unknown=on_unknown)
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/env/local/lib/python2.7/site-packages/bson/codec.py", line 237, in encode_document
    generator_func, on_unknown)
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/env/local/lib/python2.7/site-packages/bson/codec.py", line 199, in encode_value
    buf.write(encode_string_element(name, value))
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/env/local/lib/python2.7/site-packages/bson/codec.py", line 168, in encode_string_element
    return b"\x02" + encode_cstring(name) + encode_string(value)
  File "/home/cjwatson/src/canonical/launchpad/git/launchpad/env/local/lib/python2.7/site-packages/bson/codec.py", line 130, in encode_cstring
    value = str(value).encode("utf-8")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u25a0' in position 0: ordinal not in range(128)

The document that's being encoded is a test document that happens to include a Unicode key. And the code quoted in the traceback is clearly wrong: str should be text_type here in order to have the right behaviour on both Python 2 and 3.

After adding a test and fixing this, I ran into another problem on the decoding path:

ERROR: test_utf8_binary (bson.tests.test_binary.TestBinary)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/cjwatson/src/python/bson/bson/tests/test_binary.py", line 58, in test_utf8_binary
    self.test_binary()
  File "/home/cjwatson/src/python/bson/bson/tests/test_binary.py", line 53, in test_binary
    decoded = loads(dump)
  File "/home/cjwatson/src/python/bson/bson/__init__.py", line 47, in loads
    return decode_document(data, 0)[1]
  File "/home/cjwatson/src/python/bson/bson/codec.py", line 313, in decode_document
    value = unicode(value)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

Looking at decode_document, there are a couple of places where it does .decode("utf-8") on Python 3 and unicode() on Python 2, even though the Python 3 approach would work just fine in both cases.

I'm preparing a PR to fix all this, which I'll post shortly.

cjwatson commented 4 years ago

(Yes, I know we need to get off Python 2. But upgrading bson is one of the necessary steps towards doing so, which is why I ran into this ...)