Closed mobiusklein closed 3 years ago
@levitsky This should finally be ready for review, with the fundamental functionality all in place.
This adds the parse_proforma
function to parse a string in ProForma 2.0 format into a list of peptide position tokens and a dictionary of additional modification information (unlocalized, ambiguous or labile modifications, global modification rules, and so on), a to_proforma
function to take that information and turn it back into a ProForma 2.0 string. It also includes a ProForma
class which layers on a little more behavior like mass calculation, slicing, and searching for tags by ID.
The non-user-facing bits include all the baroque machinery for dealing with six different modification vocabularies, a more forgiving tokenizer, and a slightly borrowed test suite.
There is still some documentation to iron out, especially which "implementation level" this counts as, as it implements everything but inter-peptide cross-linking support. There's also how to make the users aware of how to control how additional controlled vocabularies are loaded. Right now it uses Unimod directly from pyteomics.mass.Unimod
, but tries to import psims
to load the rest, emitting an error message if it needs one of those databases and psims
isn't installed.
Thank you for catching those leftover items.
Compliance levels:
1) Base Level Support Represents the lowest level of compliance, this level involves providing support for:
2) Additional Separate Support
These features are independent from each other:
3) Top Down Extensions
4) Cross-Linking Extensions
5) Glycan Extensions
6) Spectral Support
At this point I'm more than happy with the state of this PR. Please let me know if/when you think it's ready to merge.
Thank you.
There's another ProForma meeting tomorrow which may or may not introduce more changes. The ambiguous sequence region feature was a late addition. We'll see if more work is needed or if there are any comments from the group.
I've updated the documentation on psims
to discuss the caching mechanism in a bit more detail. No new features have been added to ProForma since the last meeting, and likely the best way to get more feedback at this point is for people to try to use it. If you're satisfied with the level of documentation within the module itself, we can merge it.
This is a draft implementation for reading and writing the ProForma notation for modified amino acid sequences. I worked to avoid adding dependencies here by making additional controlled vocabularies optional unless you try to parse a string that uses them, and then load them lazily from
psims
.It still needs more documentation (especially about how to interact with some of its implementation details and which feature annexes it supports) and tests. I can likely inherit several of those from https://github.com/topdownproteomics/sdk/blob/master/tests/TopDownProteomics.Tests/ProForma/ProFormaParserTests.cs.
The ProForma specification is going through review now, but there's already discussion of an update to allow multiple modifications at a single position.