When validate.py has to tackle with huge XML files (for instance 6\~7GB), a not so evident memory leak related to iterparse "shines". The memory leak happens both on lxml.etree and xml.etree.ElementTree implementations. With that input file and the mapping.json.gz from QfO2020, the process ends using around 4.4\~5.0GB of memory, depending on the parser implementation being used.
When
validate.py
has to tackle with huge XML files (for instance 6\~7GB), a not so evident memory leak related to iterparse "shines". The memory leak happens both onlxml.etree
andxml.etree.ElementTree
implementations. With that input file and themapping.json.gz
from QfO2020, the process ends using around 4.4\~5.0GB of memory, depending on the parser implementation being used.This pull request is composed by two commits. First one contains a variation of what it is described at https://web.archive.org/web/20210309115224/http://www.ibm.com/developerworks/xml/library/x-hiperfparse/#listing4 to avoid the XML parsing memory leak. Memory usage is stable at 1GB in the very same scenario described at the beginning.
Second commit are several optimizations, avoiding several concatenations, and tweaking the conditions so the most common cases are checked first.