python / cpython

The Python programming language
https://www.python.org
Other
62.51k stars 30.01k forks source link

xml.sax.xmlreader.XMLReader.getProperty (xml.sax.handler.property_xml_string) returns bytes #50935

Open 85eff9a1-9aa2-482e-ae67-5981ce57f8d9 opened 15 years ago

85eff9a1-9aa2-482e-ae67-5981ce57f8d9 commented 15 years ago
BPO 6686
Nosy @loewis, @amauryfa, @scoder, @taleinat, @tiran, @jfgossage, @ukarroum
PRs
  • python/cpython#9715
  • python/cpython#10328
  • python/cpython#30612
  • Files
  • expatreader.py.patch: Patch to return xml.sax.handler.property_xml_string as a string rather than bytes.
  • expatreader.py.patch2: Patch to return xml.sax.handler.property_xml_string as a string and to provide the Locator2 interface.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['expert-XML', 'type-bug', '3.9', '3.10', '3.11'] title = 'xml.sax.xmlreader.XMLReader.getProperty (xml.sax.handler.property_xml_string) returns bytes' updated_at = user = 'https://bugs.python.org/cms103' ``` bugs.python.org fields: ```python activity = actor = 'iritkatriel' assignee = 'none' closed = False closed_date = None closer = None components = ['XML'] creation = creator = 'cms103' dependencies = [] files = ['14701', '14702'] hgrepos = [] issue_num = 6686 keywords = ['patch'] message_count = 7.0 messages = ['91482', '91503', '91504', '91505', '110871', '327700', '327708'] nosy_count = 8.0 nosy_names = ['loewis', 'amaury.forgeotdarc', 'scoder', 'taleinat', 'christian.heimes', 'cms103', 'Jonathan.Gossage', 'ukarroum'] pr_nums = ['9715', '10328', '30612'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue6686' versions = ['Python 3.9', 'Python 3.10', 'Python 3.11'] ```

    85eff9a1-9aa2-482e-ae67-5981ce57f8d9 commented 15 years ago

    The documentation for the xml.sax.handler.property_xml_string SAX property states that it should be "data type: String". However when retrieving this value in Python 3.1 it returns a bytes object instead.

    This makes handling the returned value very difficult because there is no method for retrieving the character set encoding that the XML was originally encoded with.

    This is currently blocking the port of SimpleTAL to Python 3 achieving feature parity with Python 2.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 15 years ago

    Would you like to contribute a patch?

    85eff9a1-9aa2-482e-ae67-5981ce57f8d9 commented 15 years ago

    I'm not familiar with the inner workings of the expat integration with Python, so the attached patches need careful review.

    The first patch (expatreader.py.patch) is the minimum to resolve this issue. The second patch (expatreader.py.patch2) also exposes the version and encoding parameters via the Locator2 interface (http://www.saxproject.org/apidoc/org/xml/sax/ext/Locator2.html), which I'd recommend including.

    85eff9a1-9aa2-482e-ae67-5981ce57f8d9 commented 15 years ago

    Adding second patch.

    amauryfa commented 14 years ago

    A unit test (or even a sample script) showing the desired feature is needed.

    taleinat commented 5 years ago

    See additional research and discussion in the comments of PR python/issues-test-cpython#9715.

    Simply changing this to return a string rather than bytes would break backwards compatibility.

    I certainly agree that this should have returned a string in the first place, especially since the Unicode decoding is otherwise completely abstracted away and the encoding used is not made available.

    Our options:

    1. Return a string starting with 3.8, document the change in What's New & fix the docs for older 3.x.
    2. Continue returning bytes, update the docs for all 3.x that this returns bytes, and that there's no good way to know the proper encoding to use for decoding it.
    3. As 2 above, but also expose the encoding used.

    Since this appears to be rarely used and option 3 requires significantly more effort than the others, I am against it.

    Option 2 seems the safest, but I'd like to hear more from those more experienced with XML.

    f76afc2a-b4c0-40d2-99d3-840c8a79604c commented 5 years ago

    The other thing to consider which also supports option 2 is that xml.parsers.expat provides an interface to the Expat parser which is easier to use and more complete than the Sax parser implementation and is the implementation likely to be used by anyone needing a streaming parser.