meshy / framewirc

An IRC toolkit built upon Python 3's asyncio module
BSD 2-Clause "Simplified" License
35 stars 4 forks source link

Some encodings cause errors to be thrown #7

Open meshy opened 8 years ago

meshy commented 8 years ago

This reverts commit b1e96882e80dfd026c4d6197c6721cfcf5110950.

This test was in master, but has been removed as it is causing failures, and it's not obvious how to fix them. The issue is that cChardet can report encodings that are not supported by python, and this can cause decoding errors in to_unicode.

The solutions that occur to me are:

meshy commented 8 years ago

There are a number of encodings returned by cChardet that will not be directly decoded by python.

In some cases, cChardet returns an unsupported alias for an encoding that python supports. In those cases, the encoding can be mapped on our side, but the best long-term solution would be to make a PR to python to add the alias to the list for that particular encoding.

In other cases, it returns an encoding that is so similar to a supported encoding that it could be swapped out. Again, we should map it locally, but in those cases PRs should be submitted to cChardet to change the name of the encoding returned.

Finally, there are encodings detected by cChardet that are not supported by python at all. I originally thought that it'd be best to add them to python, but it turns out that their addition has already been marked as wontfix. Not sure where to go with that. Perhaps the patch in the associated issue could be turned into a library? I don't think I'd want to support it, but others might still have a use for it.

From cChardet Alias to Swap for Notes
EUC-TW (our failing test) wont fix in python
HZ-GB-2312 Combination of ASCII and gb2312.
EDIT: Appears to be supported!
ISO-2022-CN wont fix in python
TIS-620 ISO-8859-11. Same except 0xA0 (no-break space) is unassigned in TIS-620 New experiments show this as working.
X-ISO-10646-UCS-4-2143 UTF-32
X-ISO-10646-UCS-4-3412 UTF-32

Edit: corrected misconception that HZ-GB-2312 is the same as gb2312.