Remove "lightweight" from minidom description

scoder commented 13 years ago

BPO	11379
Nosy	@loewis, @freddrake, @rhettinger, @orsenthil, @pitrou, @scoder, @ezio-melotti, @merwok, @florentx
Files	issue_11379.1.patch minidom-desc.diff minidom-desc-2.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/ezio-melotti' closed_at = created_at = labels = ['expert-XML', 'type-feature', 'docs'] title = 'Remove "lightweight" from minidom description' updated_at = user = 'https://github.com/scoder' ``` bugs.python.org fields: ```python activity = actor = 'python-dev' assignee = 'ezio.melotti' closed = True closed_date = closer = 'ezio.melotti' components = ['Documentation', 'XML'] creation = creator = 'scoder' dependencies = [] files = ['24686', '24707', '24732'] hgrepos = [] issue_num = 11379 keywords = ['patch'] message_count = 62.0 messages = ['129914', '129918', '129934', '129936', '129937', '129939', '129944', '129951', '148512', '148558', '148562', '148565', '148566', '148570', '148572', '148578', '148579', '148584', '148585', '148594', '148598', '149604', '149611', '149634', '152836', '152862', '152866', '152924', '154646', '154660', '154673', '154676', '154677', '154735', '154736', '154737', '154756', '154759', '154760', '154766', '154767', '154772', '154773', '154774', '154924', '154952', '154954', '155074', '155077', '155078', '155081', '156021', '156027', '159562', '159569', '177972', '180368', '180374', '180409', '180434', '180435', '206801'] nosy_count = 12.0 nosy_names = ['loewis', 'fdrake', 'rhettinger', 'orsenthil', 'pitrou', 'scoder', 'ezio.melotti', 'eric.araujo', 'flox', 'docs@python', 'tshepang', 'python-dev'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue11379' versions = ['Python 2.7', 'Python 3.2', 'Python 3.3', 'Python 3.4'] ```

scoder commented 13 years ago

http://docs.python.org/library/xml.dom.minidom.html

presents MiniDOM as a "Lightweight DOM implementation". The word "lightweight" is easily misunderstood as meaning "efficient" or "memory friendly". MiniDOM is well known to be neither of the two.

The first paragraph then continues:

""" xml.dom.minidom is a light-weight implementation of the Document Object Model interface. It is intended to be simpler than the full DOM and also significantly smaller. """

Again, "smaller" can be misread as "low memory footprint", whereas it is actually supposed to refer to an incomplete DOM API implementation. And "simpler" is also clearly exaggerated when compared to the alternative ElementTree package.

I would like to see this changed and combined with a clear and visible comment that MiniDOM has very high resource profile, e.g.

""" 19.7. xml.dom.minidom — Pure Python DOM implementation

xml.dom.minidom is a pure Python implementation of the Document Object Model interface, as known from other programming languages. It is intended to provide a smaller API than the full DOM.

Note, however, that MiniDOM has a very large memory footprint compared to other Python XML libraries. If you need a fast and memory friendly XML tree implementation with a vastly simpler API, use the xml.etree package instead. """

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

-1. The description is factually correct - minidom *does* have a lower footprint than other Python DOM implementations (such as 4DOM).

scoder commented 13 years ago

Well, I'm not aware of many people who use 4DOM these days, and if that's what it's meant to refer to, maybe that should be made more obvious, because it currently is not at all. Even cDomlette uses only half of the memory according to

http://effbot.org/zone/celementtree.htm

When you say that the description is "factually correct", that does by no means imply that the average reader will understand how it's meant. My point is that almost everyone who reads this will draw the wrong conclusions.

Also, when you say "lower footprint", that does not yet make it "light weight" in any way. It still uses something like ten times as much memory as cElementTree or lxml in Python 2 (and likely much more than even that in Python 3), and still something like 4-5 times as much as plain Python ElementTree. That's a huge difference.

What about this phrasing then:

""" MiniDOM has a smaller memory footprint than some of the other DOM compliant implementations for Python (such as 4DOM), but uses about 10x more memory than the faster and simpler xml.etree.cElementTree module. """

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

What about this phrasing then:

""" MiniDOM has a smaller memory footprint than some of the other DOM compliant implementations for Python (such as 4DOM), but uses about 10x more memory than the faster and simpler xml.etree.cElementTree module. """

But that's not a DOM implementation - so it would be comparing apples and oranges.

scoder commented 13 years ago

It's the tree based API most python users are parsing XML with, though. So I do not agree that it's comparing apples and oranges, not at all. It's comparing tree based XML libraries, only one of which is worth being called "light weight", and that's not the one that is currently carrying that name.

I think it's worth telling new users what they are committing to when they write code that uses MiniDOM. The documentation should allow them to understand that.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

It's the tree based API most python users are parsing XML with, though. So I do not agree that it's comparing apples and oranges, not at all. It's comparing tree based XML libraries, only one of which is worth being called "light weight", and that's not the one that is currently carrying that name.

If that is a real concern, I'd rather reduce the memory footprint of minidom than put actual performance figures into the documentation that will likely outdate over time.

Notice that the documentation doesn't claim that it is a lightweight XML library, only that it's a ligthweight DOM implementation. SAX is, of course, even lighter-weight.

scoder commented 13 years ago

If that is a real concern, I'd rather reduce the memory footprint of minidom than put actual performance figures into the documentation that will likely outdate over time.

Personally, I do not think it's worth putting much work into MiniDOM. I'd rather deprecate it to prevent new code from being written for it, but that's just my personal opinion, and this is the wrong place to discuss that. Given the current performance characteristics, I wouldn't be surprised if there was quite some room for improvements left in the xml.dom package.

If you dislike the "10x", feel free to use "several times". I doubt that MiniDOM will ever get so much closer to cET and lxml to prove that phrasing wrong.

Notice that the documentation doesn't claim that it is a lightweight XML library, only that it's a ligthweight DOM implementation.

I imagine that you are as aware as I am that this nuance is easy to miss, especially for a new user. From my experience, it is very common for users, especially those with a Java-ish background, to confuse the terms "DOM" and "XML tree API/library". Hence my push to change the documentation.

SAX is, of course, even lighter-weight.

Not so much more light weight than cET's iterparse(), but that's getting OT here.

Stefan

pitrou commented 13 years ago

Agreed with Stefan's concern.

scoder commented 12 years ago

Ok, so, what do we make of this? I proposed improvements to the wording in the documentation, which make it much clearer for users what they are buying into when they start using minidom. I still think that "factually correct" but clearly misleading documentation is not helpful and that it needs fixing. Here is an updated phrasing that I hope we can settle on:

""" :mod:`xml.dom.minidom` --- Pure Python DOM implementation

[...]

:mod:`xml.dom.minidom` is a pure Python implementation of the Document Object Model interface, as known from other programming languages. It is intended to provide a smaller and simpler API than the full W3C DOM.

Note that MiniDOM has a several times larger memory footprint than :mod:`xml.etree.ElementTree`, the light-weight Python XML library in the standard library. If you do not need a (mostly) compliant W3C DOM implementation, but a fast and memory friendly XML tree implementation with an easy to learn API, use that instead. """

merwok commented 12 years ago

Is memory footprint something important enough to put in the doc? Ease of use is IMO more important, but then it becomes subjective..

scoder commented 12 years ago

I find a factor of an order of magnitude worth mentioning, because it prevents certain kinds of usages.

ezio-melotti commented 12 years ago

Usually we don't talk about performance in the doc, and in my personal experience I didn't notice any major difference between the different implementations (but than again I haven't used them much). Talking about the other implementations and their advantages/disadvantages is fine, but things like "MiniDOM has a several times larger memory footprint" seems like FUD to me (see also http://docs.python.org/dev/documenting/style.html#affirmative-tone).

freddrake commented 12 years ago

Removing "Lightweight" and changing the first paragraph to (something like)

:mod:`xml.dom.minidom` is an implementation of the Document Object Model interface. The API is slightly simpler than the full W3C DOM, but the implementation has a significantly higher memory footprint than :mod:`xml.dom.etree`.

would be entirely reasonable.

(I don't think it's wrong to discuss relative memory footprints in comparison to other modules in the standard library.)

scoder commented 12 years ago

I don't think "FUD" is a suitable term for the rather minidom-friendly wording in my last proposal. Seriously, minidom is widely known for being extremely slow and extremely memory hungry. And that is backed by basically any benchmark that has ever been done on the subject. If 4DOM, which Martin cites, is really worse in terms of performance (I never used it), it must truly be the only existing species of that kind.

Still, here's a cleaned up version of Fred's proposal that I could live with:

""" :mod:`xml.dom.minidom` --- Pure Python DOM implementation

:mod:`xml.dom.minidom` is an implementation of the Document Object Model interface. The API is (intentionally) slightly simpler than the full W3C DOM, but the implementation has a significantly higher memory footprint than the XML tree library in :mod:`xml.etree.ElementTree`. """

pitrou commented 12 years ago

I don't think "FUD" is a suitable term for the rather minidom-friendly wording in my last proposal. Seriously, minidom is widely known for being extremely slow and extremely memory hungry. And that is backed by basically any benchmark that has ever been done on the subject.

If it's both slow and memory-hungry, perhaps use the more generic "performance" instead of "memory footprint"?

ezio-melotti commented 12 years ago

Seriously, minidom is widely known for being extremely slow and extremely memory hungry. And that is backed by basically any benchmark that has ever been done on the subject.

Do you have any link? My point is that if you say thing like "significantly/several times higher memory footprint than X" you are basically scaring the users away from the module. If for an average documents it takes, say, 30-50MB of memory, it seems perfectly reasonable to me, even if ElementTree takes 3-5MB. I would actually consider 100-200MB still ok too, unless I have to parse lot of documents or I'm running low of memory for other reasons.

pitrou commented 12 years ago

My point is that if you say thing like "significantly/several times higher memory footprint than X" you are basically scaring the users away from the module.

Only those users who know they'll be processing significantly large documents. I don't think "scaring away people" is a good enough reason *not to document performance characteristics. For example, we already mention that string joining is faster than repeated concatenation; I haven't heard anyone complain that it scared people away from string concatenation. And while it's true that we shouldn't try to document performance characteristics *too precisely, it is still a good thing to document the most outstanding facts (for examples, C accelerator modules are clearly superior in performance to pure Python modules; should we shy away from documenting that, and instead present it as some kind of neutral choice?).

And, of course, if minidom gets some serious performance attention, the claims will have to be revisited. But given the amount of attention minidom gets at all, it sounds rather implausible.

If for an average documents it takes, say, 30-50MB of memory, it seems perfectly reasonable to me, even if ElementTree takes 3-5MB. I would actually consider 100-200MB still ok too

Some use cases would not really like a 100-200MB memory consumption, or even 50MB. Think a long-running daemon, for instance.

scoder commented 12 years ago

Ezio Melotti, 29.11.2011 16:26:

> Seriously, minidom is widely known for being extremely slow and > extremely memory hungry. And that is backed by basically any benchmark > that has ever been done on the subject.

Do you have any link?

I just did a quick Google search for "python minidom benchmark" and found these:

http://www.opensourcetutorials.com/tutorials/Server-Side-Coding/Python/xml-matters/page2.html

http://effbot.org/zone/celementtree.htm#benchmarks

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Note that all three authors risk being biased, but given how similar the results are, I tend to believe them.

Stefan

pitrou commented 12 years ago

I just did a quick Google search for "python minidom benchmark" and found these:

http://www.opensourcetutorials.com/tutorials/Server-Side-Coding/Python/xml-matters/page2.html

http://effbot.org/zone/celementtree.htm#benchmarks

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Note that all three authors risk being biased, but given how similar the results are, I tend to believe them.

Thanks for the links. The performance gap looks significant enough to be mentioned, at least generically.

scoder commented 12 years ago

Given that the links were generally somewhat dated and used Py2.x instead of the post-PEP393 Py3.3, here is another little benchmark, comparing the parser performance of minidom to lxml.etree (latest), ElementTree and cElementTree (stdlib) in a recent Py3.3 build (e66b7c62eec0), everything properly optimised for my platform (Linux 64bit). I used os.fork() to start a new process after importing everything and reading the file a couple of times, and before parsing. The memory usage is measured inside of the forked child using the resource module's ru_maxrss value, so it correlates with the growth of CPython's memory heap after parsing, thus giving an estimate of the maximum amount of memory used during parsing and tree building.

Parsing hamlet.xml in English, 274KB:

Memory usage: 7284 xml.etree.ElementTree.parse done in 0.104 seconds Memory usage: 14240 (+6956) xml.etree.cElementTree.parse done in 0.022 seconds Memory usage: 9736 (+2452) lxml.etree.parse done in 0.014 seconds Memory usage: 11028 (+3744) minidom tree read in 0.152 seconds Memory usage: 30360 (+23076)

Parsing the old testament in English (ot.xml, 3.4MB) into memory:

Memory usage: 20444 xml.etree.ElementTree.parse done in 0.385 seconds Memory usage: 46088 (+25644) xml.etree.cElementTree.parse done in 0.056 seconds Memory usage: 32628 (+12184) lxml.etree.parse done in 0.041 seconds Memory usage: 37500 (+17056) minidom tree read in 0.672 seconds Memory usage: 110428 (+89984)

A 25MB XML file with Slavic Unicode text content:

Memory usage: 57368 xml.etree.ElementTree.parse done in 3.274 seconds Memory usage: 223720 (+166352) xml.etree.cElementTree.parse done in 0.459 seconds Memory usage: 154012 (+96644) lxml.etree.parse done in 0.454 seconds Memory usage: 135720 (+78352) minidom tree read in 6.193 seconds Memory usage: 604860 (+547492)

And a contrived 4.5MB XML file with lot more structure than data:

Memory usage: 13308 xml.etree.ElementTree.parse done in 4.178 seconds Memory usage: 222088 (+208780) xml.etree.cElementTree.parse done in 0.478 seconds Memory usage: 103056 (+89748) lxml.etree.parse done in 0.199 seconds Memory usage: 101860 (+88552) minidom tree read in 8.705 seconds Memory usage: 810964 (+797656)

Things to note: The factor of 5-10 for the memory overhead compared to cET depends heavily on the data. Also, minidom is consistently slower by more than a factor of 10 compared to the fastest parser (apparently the one in libxml2/lxml.etree, both of which surely can't be said to provide less features than the DOM that minidom implements).

scoder commented 12 years ago

Hmm, looks like I messed up the last example. I accidentally left in the formatting whitespace, thus growing the file to 6.2 MB. Removing that, I get this for the (now really) 4.5 MB XML file with lots of structure and very little data:

Memory usage: 11600 xml.etree.ElementTree.parse done in 3.374 seconds Memory usage: 203420 (+191820) xml.etree.cElementTree.parse done in 0.192 seconds Memory usage: 36444 (+24844) lxml.etree.parse done in 0.131 seconds Memory usage: 62648 (+51048) minidom tree read in 5.935 seconds Memory usage: 527684 (+516084)

It's actually surprising how much of a difference trailing whitespace content makes in minidom (from 2MB on disk to 300MB in memory???), most likely due to the usage of dedicated DOM text nodes in the tree.

PS: I think the "XML/performance" tags on this bug would hint at a separate ticket. This is really meant as a documentation bug.

scoder commented 12 years ago

I started a mailing list thread on the same topic:

http://thread.gmane.org/gmane.comp.python.devel/127963

Especially see

http://thread.gmane.org/gmane.comp.python.devel/127963/focus=128162

where I extract a proposal from the discussion. Basically, there should be a note at the top of the xml.dom documentation as follows:

""" [[Note: The xml.dom.minidom module provides an implementation of the W3C-DOM whose API is similar to that in other programming languages. Users who are unfamiliar with the W3C-DOM interface or who would like to write less code for processing XML files should consider using the xml.etree.ElementTree module instead.]] """

I think this should go on the xml.dom.minidom page as well as the xml.dom package page. Hand-wavingly, users who are new to the DOM are more likely to hit the package page first, whereas those who know it already will likely find the MiniDOM page directly.

Note that I'd still encourage the removal of the misleading word "lightweight" until it makes sense to put it back in a meaningful way. I therefore propose the following minimalistic changes to the first paragraph on the minidom page:

""" xml.dom.minidom is a [-XXX: light-weight] implementation of the Document Object Model interface. It is intended to be simpler than the full DOM and also [+XXX: provide a] significantly smaller [+XXX: API]. """

Additionally, the documentation on the xml.sax page would benefit from the following paragraph:

""" [[Note: The xml.sax package provides an implementation of the SAX interface whose API is similar to that in other programming languages. Users who are unfamiliar with the SAX interface or who would like to write less code for efficient stream processing of XML files should consider using the iterparse() function in the xml.etree.ElementTree module instead.]] """

ezio-melotti commented 12 years ago

xml.dom.minidom is a [-XXX: light-weight] implementation of the Document Object Model interface.

This is ok.

It is intended to be simpler than the full DOM and also [+XXX: provide a] significantly smaller [+XXX: API].

Doesn't "simpler" here refer to the API already?

Another option is to add somewhere a section like: "If you have to work with XML, ElementTree is usually the best choice, because it has a simple API and it's efficient [or whatever]. xml.dom.minidom provides a subset of the W3C-DOM API, and xml.sax a SAX interface.", possibly expanding a bit on the differences and showing a minimal example with the 3 different implementations, and then link to it from the other modules' pages.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

"If you have to work with XML, ElementTree is usually the best choice, because it has a simple API and it's efficient [or whatever].

I still object such a wording, for many reasons.

786d3f11-b763-4414-a03f-abc264e0b72d commented 12 years ago

IMHO this wording proposed by Stefan:

""" [[Note: The xml.dom.minidom module provides an implementation of the W3C-DOM whose API is similar to that in other programming languages. Users who are unfamiliar with the W3C-DOM interface or who would like to write less code for processing XML files should consider using the xml.etree.ElementTree module instead.]] """

Sounds very reasonable. Perhaps something about a more Pythonic API can also be added there, in addition to "to write less code".

Any objections?

orsenthil commented 12 years ago

On Wed, Feb 08, 2012 at 03:42:16AM +0000, Eli Bendersky wrote:

Any objections?

None. The explanation sounds reasonable.

merwok commented 12 years ago

+1 to the suggested wording.

-1 to talking about a more pythonic API.

(Want a nit? s/W3C-DOM/W3C DOM/)

786d3f11-b763-4414-a03f-abc264e0b72d commented 12 years ago

Martin, do you find the wording I quoted (without the reference to a more Pythonic API) acceptable?

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

The wording in msg152836 is fine with me, in particular as it doesn't make any performance claims.

786d3f11-b763-4414-a03f-abc264e0b72d commented 12 years ago

I'm attaching a patch for Doc/library/xml.dom.minidom.rst

It adds the note as phrased by Stefan, with a tiny wording change to make the first sentence less ambiguous.

merwok commented 12 years ago

I’m not sure I would use note markup, though (cf. Raymond’s aversion to littering the doc with note and warning boxes).

786d3f11-b763-4414-a03f-abc264e0b72d commented 12 years ago

I’m not sure I would use note markup, though (cf. Raymond’s aversion to littering the doc with note and warning boxes).

I also dislike box littering, but this one seems like a really good fit for a note, since it's completely outside the flow of that documentation page.

rhettinger commented 12 years ago

This is a reasonable case for a note.

1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 12 years ago

New changeset 81e606862a89 by Eli Bendersky in branch '3.2': Issue bpo-11379: add a note in xml.dom.minidom suggesting to use etree in some cases http://hg.python.org/cpython/rev/81e606862a89

1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 12 years ago

New changeset ccd16ad37544 by Eli Bendersky in branch '2.7': Issue bpo-11379: add a note in xml.dom.minidom suggesting to use etree in some cases http://hg.python.org/cpython/rev/ccd16ad37544

786d3f11-b763-4414-a03f-abc264e0b72d commented 12 years ago

Committed to 2.7, 3.2 and 3.3

I suppose this issue can be closed now?

scoder commented 12 years ago

Thanks Eli.

What about the "Lightweight DOM implementation", though? Following Martin's comment that performance characteristics (like "fast", "memory friendly" or "lightweight") should normally not be documented, I'm still suggesting to replace it with a less easily misinterpreted phrase like "W3C DOM implementation".

786d3f11-b763-4414-a03f-abc264e0b72d commented 12 years ago

Stefan, frankly I'm not familiar enough with either xml.dom or xml.dom.minidom to have a solid opinion at this point.

merwok commented 12 years ago

I think I’ve always understood “lightweight” to mean “minimal”. xml.dom provides minidom, a basic implementation, pulldom, a different implementation, and other libraries such as 4Dom are full-fledged implementations. So “lightweight” is not a problem to me (but I acknowledge that it might be misleading for other people), especially given that I think that DOM itself is not elegant or lightweight (as in “conceptually small”).

pitrou commented 12 years ago

I think I’ve always understood “lightweight” to mean “minimal”.

Then how about saying "minimal" instead of "lightweight"? (also, it seems it really means "incomplete" or "partial", which are of course less positive sounding)

ezio-melotti commented 12 years ago

"Minimal" sounds good to me, it also matches the name of the module.

merwok commented 12 years ago

Right, patch for 3.2. Also edited the module docstring (info taken from the docstring of xml.dom). BTW I really think we could have avoided some verbosity by adding the recommendation to use xml.etree in the first paragraph of Doc/library/xml.dom.minidom.rst.

merwok commented 12 years ago

s/Mininal/Minimal/ in the synopsis

scoder commented 12 years ago

Yes, I think that's better.

merwok commented 12 years ago

This alternate version of my patch (a) merges the first two paragraphs to make the intro less redundant and heavy, and (b) reorganizes a bit the list of modules in Doc/library/markup.rst to have xml.etree first and pyexpat (less interesting for most people) at the end. Tell me if you prefer this version, or if I should commit the first one (possibly with the (b) change).

1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 12 years ago

New changeset d99c0a4b66f3 by Éric Araujo in branch '3.2': Move xml.etree higher and xml.parsers.expat lower in the markup ToC. http://hg.python.org/cpython/rev/d99c0a4b66f3

1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 12 years ago

New changeset fc32753feb0a by Éric Araujo in branch '2.7': Move xml.etree higher and xml.parsers.expat lower in the markup ToC. http://hg.python.org/cpython/rev/fc32753feb0a

merwok commented 12 years ago

FYI, note that http://wiki.python.org/moin/MiniDom says this about minidom: “slow and very memory hungry DOM implementation”.

As you have seen, I have applied my ToC order change. Now in order to commit my s/lightweight/minimal/ change and close this report, can you Eli say if minidom-desc-2 is okay (I’m asking you because this patch touches text you just added, contrary to minidom-desc)?

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

FYI, note that http://wiki.python.org/moin/MiniDom says this about
minidom: “slow and very memory hungry DOM implementation”.

Thanks for the notice; I have now fixed that wording.

786d3f11-b763-4414-a03f-abc264e0b72d commented 12 years ago

Éric,

I'm ok with replacing "lightweight" by "minimal", unless others have objections. Regarding the specifics of the minidom-desc-2.diff patch:

"proficient with the DOM"

I'm not sure "the DOM" is semantically correct. "the W3C-DOM interface" is more precise.

Also, I still think that a note would be more appropriate, but I don't care enough to argue about it :)

python / cpython

Remove "lightweight" from minidom description #55588