Closed GoogleCodeExporter closed 9 years ago
D'oh. My "patch" was overly naive. Some kind of stack is probably needed.
This would be trivial in a recursive implementation of depth-first descent
(relying on function stack), but I think it requires an explicit stack when the
traversal is iterative. I'm currently trying to see if I can make this work in
the current implementation without a stack, but it will be more complicated
than my initial naive approach. (Sorry, I just assumed a depth first tree
traversal would be done recursively...)
Original comment by MacRei...@gmail.com
on 8 Dec 2013 at 10:45
OK, the following appears to work, at least for my simple XML files/trees, and
does not require an extra stack (so it should not put any extra memory burden
on anyone, though it does potentially increase processing time, obviously).
Sorry if my editors have mixed spaces and tabs...
Original comment by MacRei...@gmail.com
on 8 Dec 2013 at 11:43
Attachments:
I agree that the current treatment of begin/end does not make sense; the
interface was copied verbatim from pugxml; I intended to fix it before 1.0
release but ended up leaving it as is because it was frequently used. I'm not
sure how often begin/end are being overridden but that's a pretty radical
change; I would rather introduce new members, i.e. enter()/leave() or
node_begin()/node_end() or for_each_begin()/for_each_end().
An alternative is to add a new version of the interface and mark the existing
one deprecated; I'm still figuring out if preserving binary compatibility is a
priority for next pugixml version (may be important since pugixml is being
distributed as a package for some Linux distros; adding virtual functions to an
existing interface is unlikely to preserve ABI).
Original comment by arseny.k...@gmail.com
on 20 Dec 2013 at 5:37
Thank you for considering it. As it turns out, I ended up not needing it, but
if you'd been wanting to do it for awhile anyway, maybe my patch will help some.
(As for why I don't need this change anymore: I was using the entry/exit
callbacks to simulate expat callbacks to support someone else's code. When I
dug deeper into their code, I saw that they'd gone to great lengths in their
expat callbacks to build up context so that they could work in terms of nodes.
Which meant that I was tearing nodes apart to fit their interface so that they
could build the nodes back up to do their processing... So, when I rewrote
their code to just work on the nodes during traversal, it got 20+ times faster,
and works perfectly fine with pugixml as it is. Maybe it's better to _not_
have the entry/exit code so that people (like me) are forced to consider
rewriting other code to be more appropriate. I don't know -- and I don't have
a wide enough experience base to guess.)
Original comment by MacRei...@gmail.com
on 20 Dec 2013 at 2:56
Thank you for the feedback - I did not think about this in such terms. I
assumed that enter/exit callbacks that you use are sufficiently complicated
that you don't want to change their code, but I probably should have suggested
using something other than traverse().
For pugixml 1.0 I actually considered either removing begin/end outright, or
just removing the entire interface. It feels a bit out-of-place, and in general
I find that visitor abstraction for an XML tree is not very valuable -- often
inverting the control by i.e. recursively traversing yourself makes the code
easier.
There is one thing that traverse() does that can make it useful - it performs a
stackless traversal of the subtree, which is not trivial to replicate. The only
other way of doing it without implementing it yourself is to use
xml_node::find_node, but that's a bit of a hack, i.e.:
node.find_node([](xml_node n) -> bool { printf("%s\n", n.name()); return false;
});
A few people have suggested introducing a subtree iterator that would allow you
to control the traversal, which is likely a good direction to go here.
Original comment by arseny.k...@gmail.com
on 20 Dec 2013 at 4:39
The original callbacks do appear complicated -- or at least very large -- at
first sight, hence my initial reluctance to modify them at all. And the rest
of the code has proven to be a bit fragile, adding to my concerns. So I
originally erred on the side of absolute minimal changes to existing code. But
bumping load times from ~3 seconds to up over a minute forces one to
re-evaluate :)
I am still using the traverse() method because it lets me use an only slightly
modified version of the original author's logic. But as I made those
modifications, I saw that it does have some of the "inverted logic" you
mention, where the visitor code says "if my parent is X and my grandparent is
Y, then do F with me." That's hideously inefficient at a low level because of
all the parent lookup and comparison. I realized that I could probably replace
it all with direct path lookups and a few sibling iterations, which should be
noticeably faster. It's all a matter of performance payoff versus the risk of
introducing a subtle bug, and the current approach is "fast enough" while
retaining all of the original logic. But it is rather "bleah..." to look at.
As a side note: I do like the stackless traversal for known upper limits on
memory usage when used on embedded devices, though you'll notice that I wasn't
expecting it in my original patch, which was hopelessly naive and assumed a
recursive process.
At this point, between my own limited experiences and the ones you've
described, I'm not sure it's worth putting in my original request. Your
mention of just removing begin/end sounds reasonable, as their current
implementations are of minimal use and trivial to do as simple calls before and
after traverse(). (Dependent, of course, on your previously mentioned ABI
concerns) Since that would require some updates to the documentation, it might
also be worth adding some quick discussion in the traverse() documentation of
the (lack of) applicability of the visitor patterns in real-world XML.
Thank you for talking this out with me.
Original comment by MacRei...@gmail.com
on 20 Dec 2013 at 5:11
At this point it seems like the only good solution is leaving the original
interface in-tact (compatibility issues prevent me from removing begin/end or
changing the behavior). We'll see if this resurfaces in case anybody else needs
such callbacks; closing as WontFix for now.
Original comment by arseny.k...@gmail.com
on 27 Jan 2014 at 12:52
Original issue reported on code.google.com by
MacRei...@gmail.com
on 8 Dec 2013 at 10:09