python / cpython

The Python programming language
https://www.python.org
Other
63.53k stars 30.43k forks source link

xml.etree.ElementTree: file source must be binary for non-UTF-8 encodings #99064

Open coproc opened 2 years ago

coproc commented 2 years ago

Documentation

In the documentation for xml.etree.ElementTree.parse it says for the first argument source: "... source is a filename or file object containing XML data. ..."

But for file objects this does not work in all (expected) cases. A hint like the following is missing:

For file objects containing XML data with non-ASCII and non-UTF-8 encoding (e.g. ISO 8859-1), the file must have been opened in binary mode.

Otherwise (if opening the file in ASCII mode, regardless of the specified encoding) non-ASCII characters are not read correctly. (see this question on stackoverflow and also the attached files in test_parseXml.zip for reproducing the problem)

Here is an excerpt of the attached test code:

import xml

# ok
with open('test_ISO-8859-1.xml', 'rb') as fileInBinary:
    root = xml.etree.ElementTree.parse(fileInBinary).getroot()
print(root.attrib['attributeWithUmlauts'])

# garbage
with open('test_ISO-8859-1.xml', 'r', encoding='ISO-8859-1') as fileInAscii:
    root = xml.etree.ElementTree.parse(fileInAscii).getroot()
print(root.attrib['attributeWithUmlauts'])

giving the following output:

äöü
äöü

Linked PRs

vstinner commented 1 month ago

It's unclear to me which encodings should emit a warning or not. For example: