pimutils / vdirsyncer

📇 Synchronize calendars and contacts.
https://vdirsyncer.pimutils.org/
Other
1.57k stars 163 forks source link

_clean_body mangles UTF-16 #1001

Open comex opened 2 years ago

comex commented 2 years ago

Version: 6d0f9e94e423353f6101360c8e805258eab1c78e Python: Python 3.10.4 on macOS

I'm using vdirsyncer with a CalDAV server which returns results in UTF-16.

Test case:

import vdirsyncer.storage.dav

a = '<foo><bar>\u1234</bar></foo>'.encode('utf-16')
print('bytes:', a)
print('cleaned:', vdirsyncer.storage.dav._clean_body(a))
print('parsed:', vdirsyncer.storage.dav._parse_xml(a))

Output:

bytes: b'\xff\xfe<\x00f\x00o\x00o\x00>\x00<\x00b\x00a\x00r\x00>\x004\x12<\x00/\x00b\x00a\x00r\x00>\x00<\x00/\x00f\x00o\x00o\x00>\x00'
Your server incorrectly returned ASCII control characters in its XML. Vdirsyncer ignores those, but this is a bug in your server.
cleaned: b'\xff\xfe<foo><bar>4</bar></foo>'
Your server incorrectly returned ASCII control characters in its XML. Vdirsyncer ignores those, but this is a bug in your server.
Traceback (most recent call last):
  File "/Users/comex/src/vdirsyncer/vdirsyncer/storage/dav.py", line 92, in _parse_xml
    return etree.XML(_clean_body(content))
  File "/opt/homebrew/Cellar/python@3.9/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/xml/etree/ElementTree.py", line 1349, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 3

In this case, _clean_body is stripping the 0x00 byte representing the upper half of each ASCII code unit, as well as the 0x12 byte representing the upper half of the non-ASCII code unit 0x1234. This results in something that resembles UTF-8, but where the non-ASCII character is corrupted, and the byte-order mark at the beginning is invalid for UTF-8. The latter causes etree.XML to fail to parse it.

If _parse_xml is changed to bypass _clean_body and pass the UTF-16 string directly to etree.XML, it works fine in this case; the parser is able to guess the encoding.

I suggest changing _parse_xml to first try to parse XML as-is, and only call _clean_body if that fails.

WhyNotHugo commented 2 years ago

Issue also seems present on master. I don't think we've done any testing with utf-16 before (I certainly don't recall this).

I'm open to trying to drop _clean_body, but I suspect many others things will break.

WhyNotHugo commented 1 year ago

Which server are you using? I'd like to test the new implementation to confirm that it handles UTF-16 correctly.