levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

Parse Annotated MGFs #12

Closed jmueller95 closed 3 years ago

jmueller95 commented 3 years ago

This PR extends the MGF module so that MGFs with annotated mass spectra can be parsed (for an example, see test_annotated.mgf ). These files are e.g. output by the MS/MS prediction tool pDeep (https://github.com/pFindStudio/pDeep). The new feature can be switched on by setting read_ions=True in the mgf.read() routine.

levitsky commented 3 years ago

Thank you for this very nice contribution! I'll be happy to merge it after some minor wrinkles are ironed out, which I am happy to assist with. I have pushed a commit where I addressed most of the things I noticed, the main one being that mgf.write wasn't quite working with ions. I also added a test for that, and changed a print() call to warn() in _parse_ion. Please take a look and let me know if you're OK with these changes.

Apart from that, I have just a couple of questions left:

  1. As I'm not familiar with pDeep, are those files guaranteed to have all of the fragments annotated like that? (Are these predicted spectra or experimental spectra with annotations?). If some annotations can be missing, we could use masked arrays the same ways as with charges. (Nevermind, I see that pDeep is a prediction tool, so the answer is yes and masked arrays are not needed. Unless you know of any other uses for this format where the annotations may be sparse?)
  2. I'm curious about the change you made in pyteomics/auxiliary/file_helpers.py (explicitly converting the element ID to str). Why was it necessary? It's kind of at the heart of all the indexed parser functionality so I'm extra cautious about changing anything there.
mobiusklein commented 3 years ago

This is probably minor, but the changes do break previously pickled objects since it changes the signature of __init__. To preserve this, just move the read_ions parameter to the end of the parameter list of __init__ and get/set it in the __getstate__/__setstate__ methods instead of in __reduce_ex__. These objects probably shouldn't be serialized for long-term storage though.

levitsky commented 3 years ago

Thank you for the comment @mobiusklein. I'm not sure I like the idea of permanently changing the code for a one-time reason, so it got me thinking if I could write a one-off script to convert the pickled data based on the latest version of the code: basically copy the old definitions into the script, patch them onto the actual module, read in the data, then dump it to a new file. I figured I could set self._read_ions = False in the patched class's __init__ and then change self.__class__ to mgf.IndexedMGF (i.e. new class) in __reduce__ex__. Do you think it's a viable approach?

I'm not sure what to do about plain MGF though. Upon a closer look it doesn't appear to be picklable in any practical sense anyway.

Here's a version that only deals with IndexedMGF in pickled data: https://pastebin.com/s0D40Na3

jmueller95 commented 3 years ago

Thank you very much for your comments! I implemented the changes you suggested, including those by @mobiusklein (you can decide if you want to keep them). As you already figured @levitsky , pDeep has annotations for all its peaks, so masked arrays aren't necessary - I haven't seen any files where the annotations were only present partially. Ah, and the change in pyteomics/auxiliary/file_helpers.py turns out not to be necessary anymore, so I reverted it to the original version. Thanks for pointing that out!

mobiusklein commented 3 years ago

Thank you, that fixes my concern.

levitsky commented 3 years ago

OK, thank you all! I've reverted another addition of str() conversion in get_by_id, this one inside mgf.py. Hopefully it's not going to be a problem; at least in the tests, it is not needed.

jmueller95 commented 3 years ago

Thanks, no this is not a problem for my use case either.