python / cpython

The Python programming language
https://www.python.org
Other
63.67k stars 30.51k forks source link

ElementTree -- provide a way to ignore namespace in tags and searches #62504

Closed 8e115224-c151-4882-8772-a3cb861f588d closed 1 year ago

8e115224-c151-4882-8772-a3cb861f588d commented 11 years ago
BPO 18304
Nosy @rhettinger, @scoder, @vadmium
Files
  • etree_strip_namespaces.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['expert-XML', '3.8', 'type-feature', 'library'] title = 'ElementTree -- provide a way to ignore namespace in tags and searches' updated_at = user = 'https://bugs.python.org/brycenesbitt' ``` bugs.python.org fields: ```python activity = actor = 'lars.hammarstrand' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)', 'XML'] creation = creator = 'brycenesbitt' dependencies = [] files = ['43912'] hgrepos = [] issue_num = 18304 keywords = ['patch'] message_count = 19.0 messages = ['191894', '194896', '194899', '194901', '194906', '194919', '194920', '194923', '216768', '216774', '235724', '271448', '271466', '271472', '271475', '305405', '341000', '341001', '389391'] nosy_count = 9.0 nosy_names = ['rhettinger', 'scoder', 'eli.bendersky', 'martin.panter', 'brycenesbitt', 'pocek', 'jjmiller50', 'Tim Chambers', 'lars.hammarstrand'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue18304' versions = ['Python 3.8'] ```

    8e115224-c151-4882-8772-a3cb861f588d commented 11 years ago

    ElementTree offers a wonderful and easy API for parsing XML... but if there is a namespace involved it suddenly gets ugly. This is a proposal to fix that. First an example:

    ------------------ !/usr/bin/python # Demonstrate awkward behavior of namespaces in ElementTree import xml.etree.cElementTree as ET

    xml_sample_one = """\
    <?xml version="1.0"?>
    <presets>
    <thing stuff="some stuff"/>
    <thing stuff="more stuff"/>
    </presets>
    """
    root = ET.fromstring(xml_sample_one)
    for child in root.iter('thing'):
        print child.tag
    
    xml_sample_two = """\
    <?xml version="1.0"?>
    <presets xmlns="http://josm.openstreetmap.de/tagging-preset-1.0">
    <thing stuff="some stuff"/>
    <thing stuff="more stuff"/>
    </presets>
    """
    root = ET.fromstring(xml_sample_two)
    for child in root.iter('{http://josm.openstreetmap.de/tagging-preset-1.0}thing'):
        print child.tag

    Because of the namespace in the 2nd example, a {namespace} name keeps {namespace} getting {namespace} in {namespace} {namespace} the way.

    Online there are dozens of question on how to deal with this, for example: http://stackoverflow.com/questions/11226247/python-ignore-xmlns-in-elementtree-elementtree

    With wonderfully syntactic solutions like 'item.tag.split("}")[1][0:]'

    ----- How about if I could set any root to have an array of namespaces to suppress:

    root = ET.fromstring(xml_sample_two)
    root.xmlns_at_root.append('{namespace}')

    Or even just a boolean that says I'll take all my namespaces without qualification?

    scoder commented 11 years ago

    FWIW, lxml.etree supports wildcards like '{*}tag' in searches, and this is otherwise quite rarely a problem in practice.

    I'm -1 on the proposed feature and wouldn't mind rejecting this all together. (At least change the title to something more appropriate.)

    786d3f11-b763-4414-a03f-abc264e0b72d commented 11 years ago

    I was planning to look more closely at the namespace support in ET at some point, but haven't found the time yet.

    [changing the title to be more helpful]

    scoder commented 11 years ago

    There's also the QName class which can be used to split qualified tag names. And it's pretty trivial to pre-process the entire tree by stripping all namespaces from it the intention is really to do namespace agnostic processing. However, in my experience, most people who want to do that haven't actually understood namespaces (although, admittedly, sometimes it's those who designed the XML format who didn't understand namespaces ...).

    786d3f11-b763-4414-a03f-abc264e0b72d commented 11 years ago

    (although, admittedly, sometimes it's those who designed the XML format who didn't understand >namespaces ...).

    I fully concur. The design of XML, in general, is not the best demonstration of aesthetics in programming. But namespaces always seem to me to be one further step in the WTF direction. This is precisely why I didn't reject this issue right away: perhaps it's not a bad idea to provide Python programmers with *some* way to ease namespace-related tasks (even if they go against the questionable design principles behind XML).

    8e115224-c151-4882-8772-a3cb861f588d commented 11 years ago

    The mere existence of popular solutions like 'item.tag.split("}")[1][0:]' argues something is wrong. What could lmxl do to make this cleaner (even if the ticket proposal is junk).

    scoder commented 11 years ago

    Please leave the title as it is now.

    scoder commented 11 years ago

    As I already suggested for lxml, you can use the QName class to process qualified names, e.g.

    QName(some_element.tag).localname

    Or even just

    QName(some_element).localname

    It appears that ElementTree doesn't support this. It lists the QName type as "opaque". However, it does provide a "text" attribute that contains the qualified tag name.

    http://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.QName

    Here is the corresponding documentation from lxml:

    http://lxml.de/api/lxml.etree.QName-class.html

    QName instances in lxml provide the properties "localname", "namespace" and "text".

    rhettinger commented 10 years ago

    FWIW, I would like to have a way to ignore namespaces.

    For many day-to-day problems (parsing Microsoft Excel files saved in an XML format or parsing RSS feeds), this would be a nice simplification.

    I teach Python for a living and have found that it is common for namespaces to be an obstacle for people trying to get something done.

    Giving them the following answer is unsatisfactory response to legitimate needs: """ And it's pretty trivial to pre-process the entire tree by stripping all namespaces from it the intention is really to do namespace agnostic processing. However, in my experience, most people who want to do that haven't actually understood namespaces (although, admittedly, sometimes it's those who designed the XML format who didn't understand namespaces ...). """

    scoder commented 10 years ago

    You can already use iterparse for this.

        it = ET.iterparse('somefile.xml')
        for _, el in it:
            el.tag = el.tag.split('}', 1)[1]  # strip all namespaces
        root = it.root

    As I said, this would be a little friendlier with support in the QName class, but it's not really complex code. Could be added to the docs as a recipe, with a visible warning that this can easily lead to incorrect data processing and therefore should not be used in systems where the input is not entirely under control.

    Note that it's unclear what the "right way to do it" is, though. Is it better to 1) alter the data by stripping all namespaces off, or 2) let the tree API itself provide a namespace agnostic mode? Depends on the use case, but the more generic way 2) should be fairly involved in terms of implementation complexity, for just a minor use case. 1) would be ok in most cases where this "feature" is useful, I guess, and can be done as shown above.

    In fact, the advantage of doing it explicitly with iterparse() is that instead of stripping all namespaces, only the expected namespaces can be discarded. And errors can be raised when finding unnamespaced elements, for example. This allows for a safety guard that prevents the code from completely misinterpreting input. There is a reason why namespace were added to XML at some point.

    vadmium commented 9 years ago

    See bpo-8583 for a proposal that would apparently allow all namespaces to be ignored

    751ad8a0-6c7f-4fba-a700-daeab5d3f52f commented 8 years ago

    A flexible and pretty simple way opf loosening up handling namespaces would be to OPTIONALLY change what is done at parse time:

    1. Don't handle xmlns declarations specially. Leave them as normal attributes, and the Element.attrib would have a normal entry for each.

    2. Leave the abbreviation colon-separated prefix in front of the element tags as they come in.

    If the using code wants, it can walk the ElementTree contents making dictionaries of the active namespace declarations, tucking a dict reference into each Element. Maybe put in an ElementTree method that does this, why not?

    I'm interested in this topic because I wish to handle xml from a variety of different tools, some of which had their XML elements defined without namespaces. They can use element names which are really common - like 'project' - and have no namespace definitions. Worse: if you put one in, the tool that originally used the element breaks.

    Doing things as suggested gives the user the opportunity to look for matches using the colonized names, to shift namespace abbrevs easily, and to write out nicely namespaced code with abbrevs on the elements easily.

    This would be OPTIONAL: the way etree does it now, full prefixing of URI, is the safe way and should be retained as the default.

    scoder commented 8 years ago

    Here is a proposed patch for a new function "strip_namespaces(tree)" that discards all namespaces from tags and attributes in a (sub-)tree, so that subsequent processing does not have to deal with them.

    The "__all__" test is failing (have to figure out how to fix that), and docs are missing (it's only a proposal for now). Comments welcome.

    scoder commented 8 years ago

    On second thought, I think it should be supported (also?) in the parser. Otherwise, using it with an async parser would be different from (and more involved than) one-shot parsing. That seems wrong.

    vadmium commented 8 years ago

    Perhaps it would make more sense to use rpartition() or rstrip(). It seems possible to have a closing curly bracket in a namespace, but not in a element tag or attribute name.

    My guess is the __all failure is just a sign to add the new function to the __all variable at the top of the module.

    2d5775f8-5d93-408d-8cb5-28be7bd311be commented 7 years ago

    I suggest adding the option to keep the namespace prefixes in the tree when reading in (will it need to set a tree wide variable for the instance?). I haven't looked at the etree internals in detail.

    Add a function to ElementTree that returns the tag using the namespace prefix (eg. treenode.tagpre). Namespaces and prefixes are cached and used to expand the prefix only when absolutely required. Some XML/xpath search operations currently assume the full expanded namespaces not prefixes which may lead to side-effects. You can leave the default behaviour for compatibility. Using prefixes in the tree storage and searches would reduce memory and CPU time (when no expansion is required).

    scoder commented 5 years ago

    Coming back to this issue after a while, I think it's still a relevant problem in some use cases. However, it's not currently clear what an improved solution would look like. The fully qualified tag names in Clark notation are long, sure, but also extremely convenient in being explicit and fully self-contained.

    One thing I noticed is that the examples used in this and other reports usually employ the .find() methods. Maybe issues 28238 and 30485 can reduce the pain to an acceptable level?

    Regarding the specific proposals:

    root.xmlns_at_root.append('{namespace}')

    This cannot work since searches from a child element would then not know about the prefix. Elements in ElementTree do not have a global tree state.

    1. Leave the abbreviation colon-separated prefix in front of the element tags as they come in.

    Note that prefixes are only indirections. They do not have a meaning by themselves and multiple prefixes can (and often do) refer to the same namespace. Moving the prefix resolution from the parser to the users seems to make the situation worse instead of better.

    scoder commented 5 years ago

    I was referring to bpo-28238 and bpo-30485.

    6f26a927-1fc0-452a-9eda-31f513d6a297 commented 3 years ago

    Any update regarding this?

    We switched to lxml to make life easier but it would be useful if this functionality also was implemented in the standard library.

    Wishlist:

    1. Flag to ignore all namespaces during find().
    2. Ability to set default namespace the during find().
    3. Clear existing namespaces similar to lxml cleanup_namespaces.
    aigarius commented 1 year ago

    This is really messing up a lot of workflows. Example:

    import io
    import xml.etree.ElementTree as ET
    
    afile = io.BytesIO(b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root xmlns="http://example.com/2008/test.nsp"><child><grand></grand></child></root>')
    et = ET.parse(afile)
    afile = io.BytesIO()
    et.write(afile)
    afile.seek(0)
    afile.read()
    
    Output: 
    b'<ns0:root xmlns:ns0="http://example.com/2008/test.nsp"><ns0:child><ns0:grand /></ns0:child></ns0:root>'
    

    Who asked Python to rewrite the name of all tags to include the id of the only namespace that is there in the document? It should automatically figure out the default namespace and consistently use it as the default namespace. Plus it is losing the XML version information tag. The expectation would be that parsing and dumping a simple XML file like that would produce byte-identical output.

    scoder commented 1 year ago

    The expectation would be that parsing and dumping a simple XML file like that would produce byte-identical output.

    This is an invalid assumption. XML parsers apply a couple of adaptations such as whitespace normalisation that prevent byte-identical reproduction. Additionally, namespace prefixes are volatile, they do not have semantic meaning but only exist for technical reasons. There is no reason (apart from human readability) why namespace prefixes need to be preserved across input-output cycles.

    If you want byte-identical reproduction, use canonical XML. That's the one and only reason why it exists.

    scoder commented 1 year ago

    Who asked Python to rewrite the name of all tags to include the id of the only namespace that is there in the document? It should automatically figure out the default namespace and consistently use it as the default namespace.

    I don't think everyone would be happy about the slowdown in serialisation that this would introduce. Currently, the prefix mapping can be built on the fly during serialisation. In order to optimise the namespace prefixes for the case of a single namespace, we'd have to walk the entire document up-front to collect all namespaces, then create a prefix mapping, and then walk the entire tree again to serialise it using that mapping.

    This double walk would hit all serialisations, not only those that use namespaces, or those that use multiple namespaces. The intention to produce slightly more human readable XML output in a single use case really is not worth that impact.

    scoder commented 1 year ago

    After resolving https://github.com/python/cpython/issues/72425 and https://github.com/python/cpython/issues/74670 back in 2019 (Python 3.8), I actually think that we can close this issue. The main processing problems with wildcards should be resolved by these changes.