Open comex opened 2 years ago
Issue also seems present on master. I don't think we've done any testing with utf-16 before (I certainly don't recall this).
I'm open to trying to drop _clean_body
, but I suspect many others things will break.
Which server are you using? I'd like to test the new implementation to confirm that it handles UTF-16 correctly.
Version: 6d0f9e94e423353f6101360c8e805258eab1c78e Python: Python 3.10.4 on macOS
I'm using vdirsyncer with a CalDAV server which returns results in UTF-16.
Test case:
Output:
In this case,
_clean_body
is stripping the 0x00 byte representing the upper half of each ASCII code unit, as well as the 0x12 byte representing the upper half of the non-ASCII code unit 0x1234. This results in something that resembles UTF-8, but where the non-ASCII character is corrupted, and the byte-order mark at the beginning is invalid for UTF-8. The latter causesetree.XML
to fail to parse it.If
_parse_xml
is changed to bypass_clean_body
and pass the UTF-16 string directly toetree.XML
, it works fine in this case; the parser is able to guess the encoding.I suggest changing
_parse_xml
to first try to parse XML as-is, and only call_clean_body
if that fails.