Add Sequence Ambiguity and Proteoform Classification

zrolfs commented 3 years ago

I wrote code for classifying proteoform identifications based on the following publication: Smith, L.M., Thomas, P.M., Shortreed, M.R. et al. A five-level classification system for proteoform identifications. Nat Methods 16, 939–940 (2019). https://doi.org/10.1038/s41592-019-0573-x

As part of this, I implemented notation for sequence ambiguity using "(?". Example: PROTEI(?N) or (?PR)OTEIN.

zrolfs commented 3 years ago

This code is additionally present in mzLib

rfellers commented 3 years ago

Hi @zrolfs, sorry for the delay and thanks for the PR! I looked through the code and will definitely accept it as a starting point for this classification logic. Since you wrote this in 2019, we have been working on ProForma 2.0 (which this sdk now implements). In this version, ambiguity is a first class citizen, so your pipe separated notation is (likely) no longer required. At some pace, I'd like to change the public API on your class, use the ProFormaParser directly, and update the unit test to use ProForma 2.0. Are you interested in collaborating on this work? No pressure: I can assign it internally or we can exchange some PRs if you're in.

zrolfs commented 3 years ago

Hi @rfellers! I'm super interested in collaborating. I was planning to make the classifier method parse ProForma 2.0, but hadn't gotten around to it. I didn't realize it was already implemented in this SDK. I'll take a shot at updating the classifier with the ProFormaParser.

zrolfs commented 3 years ago

I might be wrong, but it looks like ProForma doesn't currently include information about ambiguous amino acid sequences (except for B/J/X/Z) or genes.

For example, if you had an identification that could be either PROTEOFORM or RPOTEOFORM, there's not a clear way to present that information. I saw that chimeric spectra are annotated with "+" (e.g. EMEVEESPEK/2+ELVISLIVER/3), but I don't think that's quite the same because the chimeric annotation implies that both sequences were confidently identified rather than ambiguity in a single sequence. Has the ProForma team thought about accounting for these differences, or is the current solution to simply provide multiple ProForma strings with a delimiter? The gene(s) can just be provided as a separate list.

rfellers commented 3 years ago

You are correct, ProForma v2 doesn't currently provide a syntax to express sequence ambiguity. But, I think ProForma should have something like this! In the past, we'd fought against including SNP notations to change the sequence (because it is inherently biological and changes the mass of the proteoform), but your use case is observation based and keeps the mass equal. Also, I really think logic like yours should take a single ProForma + proposed gene assignment(s) and produce a proteoform level. Adding support for level 2C would get us there.

We are still finalizing the format and it might be open to slight revisions. Do you have any notations to propose? Off the top of my head, something like (PR)?OTEOFORM might suffice, but I'd need to review all the existing ambiguity notations to see if that would be confusing.

zrolfs commented 3 years ago

The SNP context helps. I didn't quite understand the logic because many PTMs are biologically relevant and change the mass of the proteoform. However, I can see an argument to exclude SNPs because the SNP information is already presented as a different amino acid in the ProForma sequence.

My use case is certainly observation based. ProForma v1 was specifically designed for fully characterized proteoforms and works well for curating proteoform databases/atlases. The current ProForma v2 is moving out of that fully characterized requirement and trying to be more observation based (as evidenced by the ability to annotate unlocalized PTMs). I think this is great because an observation-based approach will expand the use of ProForma to proteoform/peptidoform identification results and allow for a unified proteoform notation across proteoform identification programs. Notation to account for sequence ambiguity is thus important not only for proteoform classification, but also because it will allow ProForma to be used in observational settings like TDPortal, MetaMorpheus, TopPIC, etc.

(PR)?OTEOFORM would work for permutations, but there are lots of isobaric amino acid differences that would be more challenging with that notation. Some examples below:

My initial thoughts are that a delimiter would be the easiest way to notate sequence ambiguity, although it's not very elegant and can somewhat reduce human readability. Hopefully there's not too much pushback, since delimiting was deemed acceptable for chimeric spectra. Actually, the chimeric spectra functionality is kind of odd to me. In MetaMorpheus, we handle chimeras by displaying multiple different rows for each unique species. Each row has the same scan number but different masses/charges/sequences/genes/scores/q-values/etc. I'm struggling to think of a use case where the current ProForma chimeric notation would be beneficial.

If everyone is agreeable to delimiting sequence ambiguity, we could use ';', or the '+' that's currently used for chimeric spectra.

Notation examples from the above image: ATGSPNTLFQR+TAGSPNTLFQR TAGSPNTLM[Ox]KR+TAGSPNTLFQR TQSPNTLFQR+TAGSPNTLFQR TAGSPGGTLFQR+TAGSPNTLFQR

rfellers commented 3 years ago

I just had a call with the larger ProForma working group and I brought up this use case for discussion. Given that we are late stage in this process and the potential complexity, we won't be able to fully support all the examples you laid out. But there is interest from @edeutsch and others to handle some of the most common cases to improve the look of the annotation. He already has a simple notation that he uses to handle AA permutation for example.

Generically, it is always possible to use the X amino acid and represent these ambiguities with mass shifts, e.g. X[+253.15]OTEIN. To add some context, it is also possible to use an info tag like this: X[+253.15|INFO:(PR) or (RP)]OTEIN. However, I will admit that this isn't the prettiest (or tersest) way to do this.

I'd suggest that we come up with a notation to handle permutation and maybe N|GG because it is so common. But, as I personally don't have a vested interest in this particular use case, I'll leave it up to you and Eric (and maybe others) to work through.

zrolfs commented 3 years ago

Alrighty! I've added sequence ambiguity to ProForma and implemented proteoform classification for ProForma.

Sequence ambiguity notation examples for the above examples: (?TA)GSPNTLFQR TAGSPNTL(?FQ)R T(?Q)SPNTLFQR TAGSP(?N)TLFQR

Also acceptable notations: (?AT)GSPNTLFQR TAGSPNTL(?MK)[Ox]R T(?AG)SPNTLFQR TAGSP(?GG)TLFQR

zrolfs commented 3 years ago

I also fixed the group tag notation so that the descriptor value is on the preferred localization. It was previously on the first possible localization of the group, rather than the preferred. Sorry there's a lot in this PR! I can break it up into more manageable chunks if you'd like.

rfellers commented 3 years ago

Had a call yesterday with the larger group and everyone was impressed that we had a PR for ambiguous amino acids before the format was settled. So thanks for that! Good news is that the format has not changed, so we should be good.

We can do this in one big PR, it will just take a bit longer. There are a couple points I think we should discuss and I'd like to do it with some unit tests. Would you mind if I add some unit tests on your branch? I think I might have rights given it's part of a PR ...

The main thing is around modifications: I would prefer TAGSPNTL(?M[Ox]K)R over TAGSPNTL(?MK)[Ox]R. To me, I read everything in the parenthesis as part of the ambiguity and things on the outside are definitely there. So the first version means M[Ox]K can be switched with FQ, and the second version means that MK is oxidized, but the order of amino acids is unknown. Actually, this second version is flimsy and I can't think of a good example when you'd use it. Maybe a case like TAGSPNTL(?MA)[Ox]R where it is equivalent to TAGSPNTL(?CV)[Ox]R and the oxidation is known to always exist. Thoughts?

Thanks again for all your work on this!

zrolfs commented 3 years ago

Yay! Feel free to add additional unit tests to this PR. I agree with TAGSPNTL(?M[Ox]K)R. That makes sense to me. I was thinking if you had a situation with two modifiable residues, like TAGSPNTL(?M[Ox]MK), but then it still makes sense to have the PTM localized within the sequence ambiguity. TAGSPNTL(?M[Ox]K)R can be read in this PR, but it's not written correctly, so I'll need to make a commit to fix that. Next week is looking busy, but will probably have time the week after. You're also welcome to patch it yourself if there's a time crunch.

edeutsch commented 3 years ago

Maybe a case like TAGSPNTL(?MA)[Ox]R where it is equivalent to TAGSPNTL(?CV)[Ox]R and the oxidation is known to always exist. Thoughts?

Although this is a special case (perhaps the exception that proves the rule), it turns out that M[Ox] (and only Ox of M) has a highly specific and prominent neutral loss -64, and so it would be possible to know that there is a M[Ox] somewhere in an ambiguous region due this indicative neutral loss. But I think M[Ox] is nearly the only case where this occurs. I don't think we need to cater to this, but it is there. I would suggest NOT using M[Ox] in examples of this because it brings to mind a weird exception.

acesnik commented 1 year ago

@rfellers, close in favor of https://github.com/topdownproteomics/sdk/pull/105?

rfellers commented 1 year ago

closing in favor of #105

topdownproteomics / sdk

Add Sequence Ambiguity and Proteoform Classification #84