levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

[proforma] Modifications cannot be pickled after being resolved #143

Closed RalfG closed 3 months ago

RalfG commented 3 months ago

Hi @mobiusklein,

We have been using the ProForma module heavily in most of our tools now, so first: A big thanks for all your work and continuous support on it!

Since Pyteomics 4.7, our tools that use multiprocessing on ProForma peptides have been throwing pickling errors (e.g., compomics/ms2rescore#128). I traced the issue to a recent addition of the Resolver to the modification itself in the _definition["source"] attribute. As it contains an sqlalchemy session, it cannot be pickled, and modifications cannot be passed on through multiprocessed functions.

Pickling of an unresolved modification works:

>>> import pickle
>>> from pyteomics.proforma import process_tag_tokens
>>>
>>> mod = process_tag_tokens("Oxidation")
>>>
>>> with open("test.pickle", "wb") as f:
>>>     pickle.dump(mod, f)

Once forcing a resolve by, for instance, getting the mass, it raises a PicklingError:

>>> mod.mass
15.994915
>>> 
>>> with open("test.pickle", "wb") as f:
>>>     pickle.dump(mod, f)
---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
Cell In[7], [line 2]
      [1] with open("test.pickle", "wb") as f:
----> [2]     pickle.dump(mod, f)

PicklingError: Can't pickle <class 'sqlalchemy.orm.session.Session'>: it's not the same object as sqlalchemy.orm.session.Session

The source entry in the _definition attribute was added in https://github.com/levitsky/pyteomics/commit/564d79f72cbe816963baec2b987e45d974d473cf. Is it required anywhere? If not, perhaps simply removing the field would fix the issue.

Let me know what you think.

Best, Ralf

mobiusklein commented 3 months ago

Yes, removing the source key from the pickled state is a reasonable choice here. It's not currently being used so far as I can tell, so it shouldn't break anything provided by pyteomics.

The other option is to just omit _definition entirely from the pickled state, but that has tradeoffs, the pros being its smaller, not tied to anything that isn't textual, but still enough to resolve from on load, but the con is that if you pickle something, and then some time later the remote source changes, what you'd get when you unpickled may no longer match what you saved.

RalfG commented 3 months ago

Thanks for looking into this!

the con is that if you pickle something, and then some time later the remote source changes, what you'd get when you unpickled may no longer match what you saved.

Pickling should be mostly regarded as a suboptimal saving solution anyway. In our case, we only need it for multiprocessing purposes, so this should be fine.