But for file objects this does not work in all (expected) cases. A hint like the following is missing:
For file objects containing XML data with non-ASCII and non-UTF-8 encoding (e.g. ISO 8859-1), the file must have been opened in binary mode.
Otherwise (if opening the file in ASCII mode, regardless of the specified encoding) non-ASCII characters are not read correctly. (see this question on stackoverflow and also the attached files in test_parseXml.zip for reproducing the problem)
Here is an excerpt of the attached test code:
import xml
# ok
with open('test_ISO-8859-1.xml', 'rb') as fileInBinary:
root = xml.etree.ElementTree.parse(fileInBinary).getroot()
print(root.attrib['attributeWithUmlauts'])
# garbage
with open('test_ISO-8859-1.xml', 'r', encoding='ISO-8859-1') as fileInAscii:
root = xml.etree.ElementTree.parse(fileInAscii).getroot()
print(root.attrib['attributeWithUmlauts'])
Documentation
In the documentation for xml.etree.ElementTree.parse it says for the first argument
source
: "... source is a filename or file object containing XML data. ..."But for file objects this does not work in all (expected) cases. A hint like the following is missing:
Otherwise (if opening the file in ASCII mode, regardless of the specified encoding) non-ASCII characters are not read correctly. (see this question on stackoverflow and also the attached files in test_parseXml.zip for reproducing the problem)
Here is an excerpt of the attached test code:
giving the following output:
Linked PRs