levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

OSError: [Errno 12] Cannot allocate memory #39

Closed hugokitano closed 3 years ago

hugokitano commented 3 years ago

Hello there,

We are using the featurexml class and running into a memory error. The FeatureXML file is 1.6 GB, while the Docker container we are using is 32 GB, so it should really work.

from pyteomics.openms.featurexml import FeatureXML, read

featurexml_iterator = read(featurexml_filename)

features = []

for i, feature in enumerate(featurexml_iterator):
    if int(feature['charge']) > 1:
        features.append(feature)

The code fails due to some sort of memory error.

Traceback (most recent call last):
--
File "/app/app.py", line 120, in <module>
main()
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/app/app.py", line 112, in main
trace_detection_runner(parquet_filename, featurexml_filename, buckets_file, trace_filename, s3_trace_filepath, msrun, msrun_id)
File "/usr/local/lib/python3.7/dist-packages/src/trace_detection.py", line 80, in trace_detection_runner
get_features_only(featurexml_filename, mongo_client, msrun, msrun_id)
File "/usr/local/lib/python3.7/dist-packages/src/trace_detection.py", line 265, in get_features_only
for i, feature in enumerate(featurexml_iterator):
File "/usr/local/lib/python3.7/dist-packages/pyteomics/auxiliary/file_helpers.py", line 176, in __next__
return next(self._reader)
File "/usr/local/lib/python3.7/dist-packages/pyteomics/xml.py", line 1261, in __next__
return next(self._iterator)
File "/usr/local/lib/python3.7/dist-packages/pyteomics/xml.py", line 572, in _iterfind_impl
for ev, elem in etree.iterparse(self, events=('start', 'end'), remove_comments=True, huge_tree=self._huge_tree):
File "src/lxml/iterparse.pxi", line 209, in lxml.etree.iterparse.__next__
File "src/lxml/iterparse.pxi", line 194, in lxml.etree.iterparse.__next__
File "src/lxml/iterparse.pxi", line 219, in lxml.etree.iterparse._read_more_events
OSError: [Errno 12] Cannot allocate memory

To me, it seems like this should really work. This is the first step of our program so no other memory has been allocated yet. Any suggestions? Thanks!

Hugo

levitsky commented 3 years ago

Hi!

I would start with checking how much memory the features list is using. Something like this should help. It's expected that the Python objects occupy much more memory than the XML describing them. For instance, the list of 2 features from a 8 KB test file included with the test suite occupies 25 KB in memory when parsed. The reader itself takes 18 KB and the number doesn't change when iterating over the features.

hugokitano commented 3 years ago

Thanks, found a way to refactor some code.