sissaschool / elementpath

XPath 1.0/2.0/3.0/3.1 parsers and selectors for ElementTree and lxml
MIT License
67 stars 19 forks source link

Attributes initially not iterated when iterating node tree. #72

Open robbert-harms opened 6 months ago

robbert-harms commented 6 months ago

Dear authors,

I encountered some strange behaviour when using the method iter() on an XPath tree. In particular, the attributes do not seem to be iterated over, unless explicitly called upon. Perhaps some lazy evaluation excludes them? Here is a small example with with I mean:

import elementpath
import lxml.etree as etree
from elementpath import get_node_tree

xml = etree.fromstring('<root id="test"></root>')
node_tree = get_node_tree(root=xml)

print(elementpath.__version__)

print('Round 1, iterate without accessing attributes:')
for node in node_tree.iter():
    print(node)

print('\nRound 2, same as nmr 1:')
for node in node_tree.iter():
    print(node)

print('\nRound 3, now we access the attributes:')
for node in node_tree.iter():
    node.attributes
    print(node)

print('\nRound 4, iterate again without accessing attributes:')
for node in node_tree.iter():
    print(node)

On my computer, this prints:

4.3.0 Round 1, iterate without accessing attributes: ElementNode(elem=<Element root at 0x7f118cde6940>)

Round 2, same as nmr 1: ElementNode(elem=<Element root at 0x7f118cde6940>)

Round 3, now we access the attributes: ElementNode(elem=<Element root at 0x7f118cde6940>) AttributeNode(name='id', value='test')

Round 4, iterate again without accessing attributes: ElementNode(elem=<Element root at 0x7f118cde6940>) AttributeNode(name='id', value='test')

You can see that in round 3, I access the node attributes just by calling the property. Afterwards, the attributes (in this case "id" of the root node) seem to be included in the iteration.

I believe that the attributes are meant to be included in the iteration since it should iterate over the XPath nodes. Let me know if that is a false assumption.

Best,

Robbert

robbert-harms commented 6 months ago

Just checked, it is also still present in v4.4.0

brunato commented 6 months ago

I believe that the attributes are meant to be included in the iteration since it should iterate over the XPath nodes. Let me know if that is a false assumption.

Hi, it is intended (there is the comment: # Iterate the tree not including the not built lazy components.) but i have to check the usage of iter() (that is used only in XPathContext.get_root()) and other methods (iter_document and 'iter_descendant`).

Maybe iter() can be replaced by iter_document and iter_descendants can be used in XPathContext.get_root(). I can add an argument lazy=False to iter()/iter_document to be explicit and avoid confusion on the intended behavior.

brunato commented 6 months ago

I found also a typo in iter_document().

After an analysis of the code I've decided to rename iter() as iter_lazy() and iter_document() as iter() (keeping also _iterdocument as an alias of iter for backward compatibility).

A lazy iteration has better performances, so better to keep the option of a lazy iteration when you have to consider only the built nodes:

>>> from xml.etree import ElementTree as ET
>>> from timeit import timeit
>>> from elementpath import get_node_tree
>>> xt = ET.parse('collection.xml')
>>> nt = get_node_tree(xt)
>>> timeit('for e in nt.iter_lazy(): pass', setup='from __main__ import nt')
10.037205275002634
>>> timeit('for e in nt.iter_lazy(): pass', setup='from __main__ import nt')
9.940140097998665
>>> timeit('for e in nt.iter(): pass', setup='from __main__ import nt')
26.356773235995206
>>> timeit('for e in nt.iter_lazy(): pass', setup='from __main__ import nt')
13.680087075001211

Best

robbert-harms commented 6 months ago

Hi brunato,

Your proposed solution works for me! Thank you for looking into this.

Best,

Robbert

brunato commented 2 weeks ago

@robbert-harms Finally a new release is out! Please check if the new API for nodes sounds better. Thank you

robbert-harms commented 4 days ago

Hi Brunato, thanks for this, looks great. I did however run into some issues after upgrading only elementpath from 4.4.0 to 4.5.0. I see you also improved the annotations, perhaps this broke my package. Unfortunately I don't have time to investigate right now. I will look at it somewhat later this year.

By the way, I don't know if you are keeping a list of packages using your software, but if you do, maybe you can add mine to it: https://github.com/robbert-harms/pyschematron/ It is a package for Schematron evaluation in pure Python. It uses your library for the crucial parts.

Best, Robbert

robbert-harms commented 1 day ago

Hi Brunato,

It works like a charm, I had to fix a small thing to make it work with your next version.

The only interesting observation is that now with iter_lazy() the attributes are by default included in the iteration, whereas they previously were not. In any case, this update is great. Thanks.

Best,

Robbert

brunato commented 12 hours ago

Hi @robbert-harms,

I've checked pyschematron and its type annotations, trying to find something that I can fix in elementpath. I found something that led me to implement the missing XPath2Parser methods in XPath1Parser as well, but raising an unsupported feature error in these cases (e.g. if you try to define a dynamic/static schema constructor).

Version 4.5.0 is full type annotated. This has involved some changes in type aliases, that reflect a more precise definition in what one have to provide as arguments for API calls. The main change is on annotations of select and evaluate complementary evaluation methods, that are now bounded to the XPath item type (before they were based on Any).

About iter_lazy() method is intended to be a lazy version of iter(), without attributes and namespaces it would be like iter_descendants. Maybe I could merge the implementation o iter and iter_lazy in that way:

    def iter(self, lazy: bool = False) -> Iterator[XPathNode]:
        """Iterates the tree not including the not built lazy components."""
        yield self

        iterators: Deque[Any] = deque()  # slightly faster than list()
        children: Iterator[Any] = iter(self.children)

        if not lazy or self._namespace_nodes:
            yield from self._namespace_nodes
        if not lazy self._attributes:
            yield from self._attributes

        while True:
            for child in children:
                yield child

                if isinstance(child, ElementNode):
                    if not lazy or child._namespace_nodes:
                        yield from child._namespace_nodes
                    if not lazy or child._attributes:
                        yield from child._attributes

                    if child.children:
                        iterators.append(children)
                        children = iter(child.children)
                        break
            else:
                try:
                    children = iterators.pop()
                except IndexError:
                    return

    def iter_lazy(self) -> Iterator[XPathNode]:   # For backward compatibility
         return self.iter(lazy=True)

thank you

robbert-harms commented 11 hours ago

Hi Brunato,

The problem I encountered was that previously this worked:

elementpath.select(element, xml_tag)

(found here: https://github.com/robbert-harms/pyschematron/blob/ef7d87568cf8741d9911892f13057c03fd6549fc/pyschematron/direct_mode/schematron/parsers/xml/parser.py#L125 )

But the new version (4.5.0) is more strict and I had to provide:

elementpath.select(element, xml_tag, namespaces={'': 'http://purl.oclc.org/dsdl/schematron'})

Is that on purpose? In the previous version I could select elements of the empty namespace without having to define that specifically.

For the rest, the new version works now for me. About the iteration, there are various options available, in my opinion:

  1. different methods for the different operations (as you have it now)
  2. keywords to change behavior (what you now suggested)
  3. if more keywords are needed you can bundle them in a container/context object
  4. strategy design pattern: remove the iter method completely and instead offer LazyIteration, FullIteration etc. You can provide a factory method to generate Iterators (get_iter() -> Iterator) and Iterators have base method iter() which changes behavior depending on the class.
  5. or, my personal preference, strategy+factory+facade pattern: you create number 4 and add a facade. In essence, you have iter(iterator: Iterator = null) take an Iteration method to do the actual iteration. If no Iteration method is provided, a default (LazyIteration) is used. A factory can be provided to generate different iterators.

I personally like number 5 a lot. It is by far the most flexible version. Semaphores (boolean flags) which you suggested have their limits. Suppose someone wants lazy iteration, but do would like to include namespace nodes? Then you get iter(lazy_namespaces=False, lazy_attributes=True). Every time you want to specialize you will have to add flags. On the contrary with number 5, people only have to specify a MyIteration() class, or instantiate LazyIteration which can offer more specific behaviour like: LazyIteration(lazy_ns=True, lazy_attr=False, ...). Also in this way you can offer ParallelizedIteration, C++BasedIteration, etc.

In my opinion changing behavior is ideally done by subclassing.

Anyway, I digress, your software is very useful and saved me an enormous amount of time.

Best,

Robbert

brunato commented 12 minutes ago

The problem I encountered was that previously this worked:

elementpath.select(element, xml_tag)

(found here: https://github.com/robbert-harms/pyschematron/blob/ef7d87568cf8741d9911892f13057c03fd6549fc/pyschematron/direct_mode/schematron/parsers/xml/parser.py#L125 )

But the new version (4.5.0) is more strict and I had to provide:

elementpath.select(element, xml_tag, namespaces={'': 'http://purl.oclc.org/dsdl/schematron'})

Is that on purpose? In the previous version I could select elements of the empty namespace without having to define that specifically.

It was a bug in v4.4: for parsing the (x)path expression the parser has to use an explicit namespace map, not the nsmap of of the XML tree (there is the commit: https://github.com/sissaschool/elementpath/commit/c97d7d3611ff2fabe64cab0e074436d10d84f8ae).

For the rest, the new version works now for me. About the iteration, there are various options available, in my opinion:

1. different methods for the different operations (as you have it now)

2. keywords to change behavior (what you now suggested)

3. if more keywords are needed you can bundle them in a container/context object

4. strategy design pattern: remove the iter method completely and instead offer LazyIteration, FullIteration etc. You can provide a factory method to generate Iterators (get_iter() -> Iterator) and Iterators have base method iter() which changes behavior depending on the class.

5. or, my personal preference, strategy+factory+facade pattern: you create number 4 and add a facade. In essence, you have `iter(iterator: Iterator = null)` take an Iteration method to do the actual iteration. If no Iteration method is provided, a default (LazyIteration) is used. A factory can be provided to generate different iterators.

I personally like number 5 a lot. It is by far the most flexible version. Semaphores (boolean flags) which you suggested have their limits. Suppose someone wants lazy iteration, but do would like to include namespace nodes? Then you get iter(lazy_namespaces=False, lazy_attributes=True). Every time you want to specialize you will have to add flags. On the contrary with number 5, people only have to specify a MyIteration() class, or instantiate LazyIteration which can offer more specific behaviour like: LazyIteration(lazy_ns=True, lazy_attr=False, ...). Also in this way you can offer ParallelizedIteration, C++BasedIteration, etc.

In my opinion changing behavior is ideally done by subclassing.

When I built a specific hierarchy for XPath nodes my first goal was to keep them simple as possible, because they are wrapping an optimized structure as cElementTree/lxml, so iter_lazy() is essential to avoid the creation of useless nodes (especially namespace nodes). In this sense the iter() method is pretty an accessory and is not used by the XPath processor.

Optimizing node modules is still an options to improve the speed of this and the dependant packages, but now I'm thinking about an evolution for this to try the application of XPath selectors to similar context like does a nice tool for managing configurations on filesystem paths.

In this case an application of strategy pattern and the factory pattern could be an option for replacing the tree builders in the current version and to create more flexible node structures.

Anyway, I digress, your software is very useful and saved me an enormous amount of time.

Very good.