martinblech / xmltodict

Python module that makes working with XML feel like you are working with JSON
MIT License
5.52k stars 462 forks source link

[BUG] XML UTF-8 with BOM fails #330

Open Kochise opened 1 year ago

Kochise commented 1 year ago

You can test any XML file with a BOM :

D:\Pyenv310>xml22yaml -i "d:\Pyenv310\TEST\Alarms.xml" -o "d:\Pyenv310\TEST\Alarms.yaml"
Traceback (most recent call last):
  File "D:\Pyenv310\Python\Lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\Pyenv310\Python\Lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\Pyenv310\Python\Scripts\xml22yaml.exe\__main__.py", line 7, in <module>
  File "D:\Pyenv310\Python\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "D:\Pyenv310\Python\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "D:\Pyenv310\Python\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "D:\Pyenv310\Python\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "D:\Pyenv310\Python\lib\site-packages\yaplon\__main__.py", line 701, in xml2yaml
    reader.xml(
  File "D:\Pyenv310\Python\lib\site-packages\yaplon\reader.py", line 71, in xml
    obj = oxml.parse(input.read(), process_namespaces=namespaces)
  File "D:\Pyenv310\Python\lib\site-packages\xmltodict.py", line 378, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 1

Regards.

mpf82 commented 1 year ago

You can specify the encoding in parse(), the default is utf-8

IANA currently lists 250+ character encodings.

Python natively supports a subset of 109 encodings (plus some Python specific encodings).

You cannot possibly expect xmltodict to know or to guess which one your input uses.

Kochise commented 1 year ago
set "PYTHONIOENCODING=utf8"

xmltodict shouldn't care about BOM

Alarms.xml.txt

mpf82 commented 1 year ago

Seems you're right, explicitely passing bytes with BOM works just fine:

import xmltodict
xml = '''<?xml version="1.0"?><test>123</test>'''
xml = xml.encode("utf-8-sig")
out = xmltodict.parse(xml)
print(out) # {'test': '123'}

So maybe the error is somewhere else? Either the file has a different encoding, or the other libs you're using are modifying the string/bytes somehow.


Edit: these work also:

from io import BytesIO, StringIO

b = BytesIO(b'\xef\xbb\xbf<?xml version="1.0"?><test>123</test>')
print(xmltodict.parse(b.read()))

b = StringIO(b'<?xml version="1.0"?><test>123</test>'.decode("utf-8-sig"))
print(xmltodict.parse(b.read()))
Kochise commented 1 year ago

Just using https://github.com/twardoch/yaplon :

D:\Pyenv310>xml22yaml -i "d:\Pyenv310\TEST\Alarms.xml" -o "d:\Pyenv310\TEST\Alarms.yaml"

It is failing there :

https://github.com/martinblech/xmltodict/blob/master/xmltodict.py#L378

From there :

https://github.com/twardoch/yaplon/blob/master/yaplon/reader.py#L71

There should be an issue around here :

https://github.com/martinblech/xmltodict/blob/master/xmltodict.py#L341