python / cpython

The Python programming language
https://www.python.org
Other
63.2k stars 30.26k forks source link

cElementTree has problems with StringIO object containing unicode content #64811

Closed 107431f5-b9b1-42f4-987d-debb56f50666 closed 7 years ago

107431f5-b9b1-42f4-987d-debb56f50666 commented 10 years ago
BPO 20612
Nosy @scoder, @serhiy-storchaka

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['expert-XML', 'type-bug'] title = 'cElementTree has problems with StringIO object containing unicode content' updated_at = user = 'https://bugs.python.org/amcone' ``` bugs.python.org fields: ```python activity = actor = 'serhiy.storchaka' assignee = 'none' closed = True closed_date = closer = 'serhiy.storchaka' components = ['XML'] creation = creator = 'amcone' dependencies = [] files = [] hgrepos = [] issue_num = 20612 keywords = [] message_count = 3.0 messages = ['211114', '211115', '255065'] nosy_count = 4.0 nosy_names = ['scoder', 'eli.bendersky', 'serhiy.storchaka', 'amcone'] pr_nums = [] priority = 'normal' resolution = 'wont fix' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue20612' versions = ['Python 2.7'] ```

107431f5-b9b1-42f4-987d-debb56f50666 commented 10 years ago

There seems to be a specific issue when using cElementTree.parse on a StringIO object containing unicode text - it generates a ParseError.

I've tried variations of ElementTree and cElementTree, variations of StringIO and cStringIO, and used str and unicode types. It seems there is one combination of these that generates the problem. I tried this on Python 2.7.5 - this bit of code shows the inconsistency we've got:

>>> from xml.etree import ElementTree as ET, cElementTree as CET
>>> from StringIO import StringIO as SIO
>>> from cStringIO import StringIO as CSIO
>>> xml, uxml = '<simple />', u'<simple />'
>>> 
>>> def parse(etree_impl, strio_class, text):
...     try:
...         return etree_impl.parse(strio_class(text))
...     except Exception as e:
...         return 'ERROR: ' + repr(e)
... 
>>> for etree_var in 'ET CET'.split():
...     for sio_var in 'SIO CSIO'.split():
...         for xml_var in 'xml uxml'.split():
...             print etree_var, sio_var, xml_var,
...             print parse(vars()[etree_var], vars()[sio_var], vars()[xml_var])
... 
ET SIO xml <xml.etree.ElementTree.ElementTree object at 0x7f92c795ec90>
ET SIO uxml <xml.etree.ElementTree.ElementTree object at 0x7f92c795ec90>
ET CSIO xml <xml.etree.ElementTree.ElementTree object at 0x7f92c795ec90>
ET CSIO uxml <xml.etree.ElementTree.ElementTree object at 0x7f92c795ec90>
CET SIO xml <ElementTree object at 0x7f92c795ec90>
CET SIO uxml ERROR: ParseError('no element found: line 1, column 0',)
CET CSIO xml <ElementTree object at 0x7f92c795ec90>
CET CSIO uxml <ElementTree object at 0x7f92c795ec90>
serhiy-storchaka commented 10 years ago

cStringIO.StringIO() can contains only str (unicode automatically coerced to str), while StringIO.StringIO() can contain str or unicode.

>>> SIO(uxml).read()
u'<simple />'
>>> CSIO(uxml).read()
'<simple />'

cElementTree.parse() works only with binary streams.

serhiy-storchaka commented 8 years ago

For now cElementTree parser just stops parsing when has read something that is not exactly of type str. Eli, Stefan, it is not hard to make cElementTree supporting Unicode streams, only few lines of code. But is it worth to do this on this stage? Or we have just close this issue as "won't fix"?