levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

Draft ProForma implementation #37

Closed mobiusklein closed 3 years ago

mobiusklein commented 3 years ago

This is a draft implementation for reading and writing the ProForma notation for modified amino acid sequences. I worked to avoid adding dependencies here by making additional controlled vocabularies optional unless you try to parse a string that uses them, and then load them lazily from psims.

It still needs more documentation (especially about how to interact with some of its implementation details and which feature annexes it supports) and tests. I can likely inherit several of those from https://github.com/topdownproteomics/sdk/blob/master/tests/TopDownProteomics.Tests/ProForma/ProFormaParserTests.cs.

The ProForma specification is going through review now, but there's already discussion of an update to allow multiple modifications at a single position.

mobiusklein commented 3 years ago

@levitsky This should finally be ready for review, with the fundamental functionality all in place.

This adds the parse_proforma function to parse a string in ProForma 2.0 format into a list of peptide position tokens and a dictionary of additional modification information (unlocalized, ambiguous or labile modifications, global modification rules, and so on), a to_proforma function to take that information and turn it back into a ProForma 2.0 string. It also includes a ProForma class which layers on a little more behavior like mass calculation, slicing, and searching for tags by ID.

The non-user-facing bits include all the baroque machinery for dealing with six different modification vocabularies, a more forgiving tokenizer, and a slightly borrowed test suite.

There is still some documentation to iron out, especially which "implementation level" this counts as, as it implements everything but inter-peptide cross-linking support. There's also how to make the users aware of how to control how additional controlled vocabularies are loaded. Right now it uses Unimod directly from pyteomics.mass.Unimod, but tries to import psims to load the rest, emitting an error message if it needs one of those databases and psims isn't installed.

mobiusklein commented 3 years ago

Thank you for catching those leftover items.

Compliance levels:

1) Base Level Support Represents the lowest level of compliance, this level involves providing support for:

2) Additional Separate Support
These features are independent from each other:

3) Top Down Extensions

4) Cross-Linking Extensions

5) Glycan Extensions

6) Spectral Support

levitsky commented 3 years ago

At this point I'm more than happy with the state of this PR. Please let me know if/when you think it's ready to merge.

mobiusklein commented 3 years ago

Thank you.

There's another ProForma meeting tomorrow which may or may not introduce more changes. The ambiguous sequence region feature was a late addition. We'll see if more work is needed or if there are any comments from the group.

mobiusklein commented 3 years ago

I've updated the documentation on psims to discuss the caching mechanism in a bit more detail. No new features have been added to ProForma since the last meeting, and likely the best way to get more feedback at this point is for people to try to use it. If you're satisfied with the level of documentation within the module itself, we can merge it.