qfo / benchmark-webservice

This repository contains the codebase that runs the webservice to benchmark orthology predictions on a common reference proteome dataset.
http://orthology.benchmark-service.org
Other
7 stars 8 forks source link

Fixes and improvements on validation #3

Closed jmfernandez closed 3 years ago

jmfernandez commented 3 years ago

When validate.py has to tackle with huge XML files (for instance 6\~7GB), a not so evident memory leak related to iterparse "shines". The memory leak happens both on lxml.etree and xml.etree.ElementTree implementations. With that input file and the mapping.json.gz from QfO2020, the process ends using around 4.4\~5.0GB of memory, depending on the parser implementation being used.

This pull request is composed by two commits. First one contains a variation of what it is described at https://web.archive.org/web/20210309115224/http://www.ibm.com/developerworks/xml/library/x-hiperfparse/#listing4 to avoid the XML parsing memory leak. Memory usage is stable at 1GB in the very same scenario described at the beginning.

Second commit are several optimizations, avoiding several concatenations, and tweaking the conditions so the most common cases are checked first.