python / cpython

The Python programming language
https://www.python.org
Other
62.36k stars 29.95k forks source link

pulldom cannot handle xml file with large external entity properly #47067

Open dac1376e-4324-4988-9e8c-35a6b1a43701 opened 16 years ago

dac1376e-4324-4988-9e8c-35a6b1a43701 commented 16 years ago
BPO 2818
Nosy @scoder, @tiran, @websurfer5

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['expert-XML', '3.8', 'performance'] title = 'pulldom cannot handle xml file with large external entity properly' updated_at = user = 'https://bugs.python.org/hanselda' ``` bugs.python.org fields: ```python activity = actor = 'Jeffrey.Kintscher' assignee = 'none' closed = False closed_date = None closer = None components = ['XML'] creation = creator = 'hanselda' dependencies = [] files = [] hgrepos = [] issue_num = 2818 keywords = [] message_count = 1.0 messages = ['66628'] nosy_count = 5.0 nosy_names = ['scoder', 'christian.heimes', 'hanselda', 'mvolz', 'Jeffrey.Kintscher'] pr_nums = [] priority = 'normal' resolution = None stage = 'needs patch' status = 'open' superseder = None type = 'resource usage' url = 'https://bugs.python.org/issue2818' versions = ['Python 2.7', 'Python 3.8'] ```

dac1376e-4324-4988-9e8c-35a6b1a43701 commented 16 years ago

when use xml.dom.pulldom module to parse a large xml file, if all the information is saved in one xml file, the module can handle it in the following way without construction the whole DOM:

events = xml.dom.pulldom.parse('file.xml')
for (event, node) in events:
    process(event, node)

But if 'file.xml' contains some large external entities, for example:

\<!ENTITY file_external SYSTEM "others.xml"> \<body>&file_external;\</body>

Then using the same python snippet above leads to enormous memory usage. I did not perform a concrete benchmark, in one case a 3M external xml file drained about 1 GB memory. I think in this case it might be the whole DOM structure is constructed.