Open dac1376e-4324-4988-9e8c-35a6b1a43701 opened 16 years ago
when use xml.dom.pulldom module to parse a large xml file, if all the information is saved in one xml file, the module can handle it in the following way without construction the whole DOM:
events = xml.dom.pulldom.parse('file.xml')
for (event, node) in events:
process(event, node)
But if 'file.xml' contains some large external entities, for example:
\<!ENTITY file_external SYSTEM "others.xml"> \<body>&file_external;\</body>
Then using the same python snippet above leads to enormous memory usage. I did not perform a concrete benchmark, in one case a 3M external xml file drained about 1 GB memory. I think in this case it might be the whole DOM structure is constructed.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['expert-XML', '3.8', 'performance']
title = 'pulldom cannot handle xml file with large external entity properly'
updated_at =
user = 'https://bugs.python.org/hanselda'
```
bugs.python.org fields:
```python
activity =
actor = 'Jeffrey.Kintscher'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['XML']
creation =
creator = 'hanselda'
dependencies = []
files = []
hgrepos = []
issue_num = 2818
keywords = []
message_count = 1.0
messages = ['66628']
nosy_count = 5.0
nosy_names = ['scoder', 'christian.heimes', 'hanselda', 'mvolz', 'Jeffrey.Kintscher']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'resource usage'
url = 'https://bugs.python.org/issue2818'
versions = ['Python 2.7', 'Python 3.8']
```