Closed 107431f5-b9b1-42f4-987d-debb56f50666 closed 7 years ago
There seems to be a specific issue when using cElementTree.parse on a StringIO object containing unicode text - it generates a ParseError.
I've tried variations of ElementTree and cElementTree, variations of StringIO and cStringIO, and used str and unicode types. It seems there is one combination of these that generates the problem. I tried this on Python 2.7.5 - this bit of code shows the inconsistency we've got:
>>> from xml.etree import ElementTree as ET, cElementTree as CET
>>> from StringIO import StringIO as SIO
>>> from cStringIO import StringIO as CSIO
>>> xml, uxml = '<simple />', u'<simple />'
>>>
>>> def parse(etree_impl, strio_class, text):
... try:
... return etree_impl.parse(strio_class(text))
... except Exception as e:
... return 'ERROR: ' + repr(e)
...
>>> for etree_var in 'ET CET'.split():
... for sio_var in 'SIO CSIO'.split():
... for xml_var in 'xml uxml'.split():
... print etree_var, sio_var, xml_var,
... print parse(vars()[etree_var], vars()[sio_var], vars()[xml_var])
...
ET SIO xml <xml.etree.ElementTree.ElementTree object at 0x7f92c795ec90>
ET SIO uxml <xml.etree.ElementTree.ElementTree object at 0x7f92c795ec90>
ET CSIO xml <xml.etree.ElementTree.ElementTree object at 0x7f92c795ec90>
ET CSIO uxml <xml.etree.ElementTree.ElementTree object at 0x7f92c795ec90>
CET SIO xml <ElementTree object at 0x7f92c795ec90>
CET SIO uxml ERROR: ParseError('no element found: line 1, column 0',)
CET CSIO xml <ElementTree object at 0x7f92c795ec90>
CET CSIO uxml <ElementTree object at 0x7f92c795ec90>
cStringIO.StringIO() can contains only str (unicode automatically coerced to str), while StringIO.StringIO() can contain str or unicode.
>>> SIO(uxml).read()
u'<simple />'
>>> CSIO(uxml).read()
'<simple />'
cElementTree.parse() works only with binary streams.
For now cElementTree parser just stops parsing when has read something that is not exactly of type str. Eli, Stefan, it is not hard to make cElementTree supporting Unicode streams, only few lines of code. But is it worth to do this on this stage? Or we have just close this issue as "won't fix"?
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['expert-XML', 'type-bug']
title = 'cElementTree has problems with StringIO object containing unicode content'
updated_at =
user = 'https://bugs.python.org/amcone'
```
bugs.python.org fields:
```python
activity =
actor = 'serhiy.storchaka'
assignee = 'none'
closed = True
closed_date =
closer = 'serhiy.storchaka'
components = ['XML']
creation =
creator = 'amcone'
dependencies = []
files = []
hgrepos = []
issue_num = 20612
keywords = []
message_count = 3.0
messages = ['211114', '211115', '255065']
nosy_count = 4.0
nosy_names = ['scoder', 'eli.bendersky', 'serhiy.storchaka', 'amcone']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue20612'
versions = ['Python 2.7']
```