matrix-org / python-canonicaljson

Canonical JSON
Apache License 2.0
31 stars 15 forks source link

encoding fails with non-BMP characters on narrow python builds #12

Closed richvdh closed 6 years ago

richvdh commented 6 years ago

TIL: some python builds don't support non-BMP characters in their strings: http://wordaligned.org/articles/narrow-python

The symptoms of this are things like:

  File "/var/synapse/.synapse/lib/python2.7/site-packages/canonicaljson.py", line 142, in encode_canonical_json
    return _unascii(s)
  File "/var/synapse/.synapse/lib/python2.7/site-packages/canonicaljson.py", line 121, in _unascii
    chunks.append(unichr(c))
exceptions.ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Since we're only using unichr here to build a string which we're about to utf-8 encode, we could instead add utf-16 delegates, which apparently are handled correctly by .encode("utf-8").

In fact, given .encode("utf-8") handles utf-16 delegates, why are we bothering to unpack the utf-16 delegates at all?

ara4n commented 6 years ago

TIL all MacOS pythons are narrow (as are win32), so this becomes a royal pain for anyone trying to develop on MacOS (especially if they're working on emoji autocomplete in riot ;P)

mpwp commented 6 years ago

I installed from git and it works sweet! Thank you very much!