Draft ProForma implementation

mobiusklein commented 3 years ago

This is a draft implementation for reading and writing the ProForma notation for modified amino acid sequences. I worked to avoid adding dependencies here by making additional controlled vocabularies optional unless you try to parse a string that uses them, and then load them lazily from psims.

It still needs more documentation (especially about how to interact with some of its implementation details and which feature annexes it supports) and tests. I can likely inherit several of those from https://github.com/topdownproteomics/sdk/blob/master/tests/TopDownProteomics.Tests/ProForma/ProFormaParserTests.cs.

The ProForma specification is going through review now, but there's already discussion of an update to allow multiple modifications at a single position.

mobiusklein commented 3 years ago

@levitsky This should finally be ready for review, with the fundamental functionality all in place.

This adds the parse_proforma function to parse a string in ProForma 2.0 format into a list of peptide position tokens and a dictionary of additional modification information (unlocalized, ambiguous or labile modifications, global modification rules, and so on), a to_proforma function to take that information and turn it back into a ProForma 2.0 string. It also includes a ProForma class which layers on a little more behavior like mass calculation, slicing, and searching for tags by ID.

The non-user-facing bits include all the baroque machinery for dealing with six different modification vocabularies, a more forgiving tokenizer, and a slightly borrowed test suite.

There is still some documentation to iron out, especially which "implementation level" this counts as, as it implements everything but inter-peptide cross-linking support. There's also how to make the users aware of how to control how additional controlled vocabularies are loaded. Right now it uses Unimod directly from pyteomics.mass.Unimod, but tries to import psims to load the rest, emitting an error message if it needs one of those databases and psims isn't installed.

mobiusklein commented 3 years ago

Thank you for catching those leftover items.

Compliance levels:

1) Base Level Support Represents the lowest level of compliance, this level involves providing support for:

[x] Amino acid sequences
[x] Protein modifications using two of the supported CVs/ontologies: Unimod and PSI-MOD.
[x] Protein modifications using delta masses (without prefixes)
[x] N-terminal, C-terminal and labile modifications.
[x] Ambiguity in the modification position, including support for localisation scores.
[x] INFO tag.

2) Additional Separate Support
These features are independent from each other:

[x] Unusual amino acids (O and U).
[x] Ambiguous amino acids (e.g. X, B, Z). This would include support for sequence tags of known mass (using the character X).
[x] Protein modifications using delta masses (using prefixes for the different CVs/ontologies).
[x] Use of prefixes for Unimod (U:) and PSI-MOD (M:) names.
[x] Support for the joint representation of experimental data and its interpretation.

3) Top Down Extensions

[ ] Additional CV/ontologies for protein modifications: RESID (the prefix R MUST be used for RESID CV/ontology term names)
[x] Chemical formulas (this feature occurs in two places in this list).

4) Cross-Linking Extensions

[ ] Cross-linked peptides (using the XL-MOD CV/ontology, the prefix X MUST be used for XL-MOD CV/ontology term names).

5) Glycan Extensions

[x] Additional CV/ontologies for protein modifications: GNO (the prefix G MUST be used for GNO CV/ontology term names)
[x] Glycan composition.
[x] Chemical formulas (this feature occurs in two places in this list).

6) Spectral Support

[ ] Charge and chimeric spectra are special cases (see Appendix II).
[x] Global modifications (e.g., every C is C13).

levitsky commented 3 years ago

At this point I'm more than happy with the state of this PR. Please let me know if/when you think it's ready to merge.

mobiusklein commented 3 years ago

Thank you.

There's another ProForma meeting tomorrow which may or may not introduce more changes. The ambiguous sequence region feature was a late addition. We'll see if more work is needed or if there are any comments from the group.

mobiusklein commented 3 years ago

I've updated the documentation on psims to discuss the caching mechanism in a bit more detail. No new features have been added to ProForma since the last meeting, and likely the best way to get more feedback at this point is for people to try to use it. If you're satisfied with the level of documentation within the module itself, we can merge it.

levitsky / pyteomics

Draft ProForma implementation #37