topdownproteomics / sdk

Software solution for common top-down proteomics tasks
http://www.topdownproteomics.org/
MIT License
9 stars 4 forks source link

Ambiguity of PTM localization #17

Open acesnik opened 6 years ago

acesnik commented 6 years ago

This issue was left unanswered in the standard. Let's come up with a proposal, implement it, and see if the committee for the standard has any issues with it.

acesnik commented 6 years ago

Overview

We should build a way to specify ambiguity both: 1) groups of amino acids, like all T, S, and Y in a sequence 2) regions of ambiguity, like an unidentified mass along a fragment or along a whole sequence

Specifying indices of other sites is helpful for computers, but it's terrible for human readability. I don't want to have to count 200 amino acids to the right of a site to find the next possible site and so on.

Proposal

We have the info key as a catchall for random other things we want to build in. I suggest we use that to make these groupings.

1) PROT[Phospho|info:group:1]EOFORMS[info:group:1]

In this example, we have a phospho that might be at one of two locations. Within the value of the info tags, we use the subkey group to denote the ambiguity group, followed by a unique (sub)value for the group. Each subsequent location has only the unique group information within the info tag.

Note that Phospho is only specified at the first mention of this group. It would get very confusing if it were specified at every one, since it would look like there was a phospho identified at each of those sites.

2) PROT[mass:19|info:start:1]EOFORMS[info:end:1]

In this example, we have an unidentified mass that found along a fragment of the sequence. The unidentified mass (+19 Da in this case) is specified at the start of this region. The start of the region is specified within an info descriptor, with a start subkey and a unique (sub)value for the region. The end of the region is specified within an info descriptor, too, with an end subkey and the same unique (sub)value for the region.

rfellers commented 6 years ago

I like this in general and I think it is very clean, but I had imagined going further with ambiguity in ProForma v2. What you proposed is still valid in the current standard. I was thinking of it as the addition of specific keys or special characters. For example, how about something like this (using your examples above):

  1. PROT[Phospho|#test123]EOFORMS[#test123]
  2. PROT[mass:19|A->]EOFORMS[<-A]

You get the idea, hashtags for grouping and arrows for ranges. I'm also toying around with the idea of using a ? character to specifically denote ambiguity. My thought is that this could help downstream consumers to more easily determine what is fully characterized and what isn't.

trishorts commented 6 years ago

This seems really good for "nesting" unknowns shifts. For example

PRO[mass:19|A->]TEO[mass:99|B->]FO[<-A]RMS[<-B]

read as mass 19 in sequence region TEOFO and mass 99 in sequence FORMS

acesnik commented 6 years ago

I like the idea of special characters to denote ambiguity and the #, ->, and <- for doing so. We would need to examine the collision with how we're specifying modifications without keys (for human readability; Unimod Interim name by default).

Would you be opposed to putting these codes into an info descriptor to not break ProForma v1?

  1. PROT[Phospho|info:#test123]EOFORMS[info:#test123]
  2. PROT[mass:19|info:A->]EOFORMS[info:<-A]

We could also propose making new keys:

  1. PROT[Phospho|#:test123]EOFORMS[#:test123]
  2. PROT[mass:19|->:A]EOFORMS[<-:A]
acesnik commented 6 years ago

We could also use the vi symbols for beginning and end, i.e. ^ and $:

  1. PROT[Phospho|info:#test123]EOFORMS[info:#test123]
  2. PROT[mass:19|info:^A]EOFORMS[info:$A]

I also want to note how using just '>' and '<' instead of arrows might look:

  1. PROT[Phospho|info:#test123]EOFORMS[info:#test123]
  2. PROT[mass:19|info:A>]EOFORMS[info:<A]
rfellers commented 6 years ago

I like this discussion, should be great for the next meeting.

For the groups, I feel like the hash symbol needs to be directly next to the group name to make it look like a hash tag that people expect. So I like [info:#test123] over the new key approach.

Re: the ranges, I like almost everything proposed except the vi stuff. Wouldn't help readability for most people IMO (but hey, I'm an emacs guy). I find the arrow a bit more readable than just the less than/greater than symbols, but I don't feel that strongly.

trishorts commented 6 years ago

I like arrow. Better than > b/c it's too close to FASTA hearder for my liking (e.g. ">sp|P17677|NEU...")

veitveit commented 6 years ago

Yep, very nice discussion!

I have a few comments: a) It might be good to avoid using too many special characters. Not only because they are already are reserved for special purposes (e.g. # for comments) but also when using the proteoform notation in a document (paper, report, ...) where special characters can be problematic. For instance, think about submitting an abstract to a conference. These webpages can be quite demanding. b) Groups: I suggest we reserve a specific key for them like "group:" instead of adding them to the info tag. Something like [group:A] c) Groups: Many times, ambiguous sites get assigned a probability or score. What about something like [group:A:95]? d) Ranges: I am on Ryan's side (emacs) but would avoid the "<,>". What about PROT[mass:19|range:A-]EOFORMS[range:-A]?

2018-02-13 16:06 GMT+01:00 Ryan Fellers notifications@github.com:

I like this discussion, should be great for the next meeting.

For the groups, I feel like the hash symbol needs to be directly next to the group name to make it look like a hash tag that people expect. So I like [info:#test123] over the new key approach.

Re: the ranges, I like almost everything proposed except the vi stuff. Wouldn't help readability for most people IMO (but hey, I'm an emacs guy). I find the arrow a bit more readable than just the less than/greater than symbols, but I don't feel that strongly.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/topdownproteomics/TestLib/issues/17#issuecomment-365294043, or mute the thread https://github.com/notifications/unsubscribe-auth/APEZhV-r-6doEuORmH2bXrf4hLgH8ujEks5tUaTagaJpZM4R7rr4 .

-- |||/ (o o) ----ooO-(_)-Ooo----

Don't worry about life; you're not going to survive it anyway.

http://computproteomics.bmb.sdu.dk

acesnik commented 6 years ago

I feel like statistical measures (probability score, localization scores, FDRs) are often particular to certain software. I suggest we place those in info tags.

acesnik commented 6 years ago

If we are willing to make new keys, we could use group:, start: and end:, too. That would eliminate special characters if that's helpful.

Here's an example. I see Shortreed's point that arrows make this easier to scan quickly. PRO[mass:19|start:A]TEO[mass:99|start:B]FO[end:A]RMS[end:B]

acesnik commented 6 years ago

Regarding ">" and "<", I don't think we be too concerned with them in descriptors. There are already plenty of Unimod entries with these characters.

veitveit commented 6 years ago

I also thought about the statistical measures such as score for the localization. If we use hashes, then hash + the site name (i.e. #mod) could become key and the score is the value. This is a bit crazy but why not?

PROT[Phospho|#mod:20]EOFORMS[Phospho|#mod:80]

acesnik commented 6 years ago

I like that idea, @veitveit. The default with no score (no value) could be assume to be equal scores.

How would we specify what kind of score they are? Info tags? I imagine we could have probabilities be the default, but there are many scores out there that could be applied, like counts of observations.

veitveit commented 6 years ago

If we assume that there is only one type of score, then it would not matter in the end whether a score giving probabilities, expectation values, or other measures. I would just say that it should be a number.

acesnik commented 6 years ago

That seems reasonable. I suppose we could require a header in the file format with a note on which score was used.

## fileformat=ProForma2.0
## scoretype=percentage
> header
PROT[Phospho|#mod:20]EOFORMS[Phospho|#mod:80]
acesnik commented 6 years ago

New issues to consider from this pull request https://github.com/topdownproteomics/sdk/pull/57

On range notation:

Is the amino acid T included in the range "A" within PROT[mass:19|A->]EOSFORMS[<-A]? If so, we might want to rethink this syntax as the arrows seem to indicate otherwise. Thoughts?"

I see your point; it does point only to the internal sequence.

But I think this comes back to the left-right issue. Each tag pertains to the amino acid to the left in ProForma v1. I suppose we could go with one of the earlier suggestions PROT[mass:19|start:group1]EOSFORMS[end:group1] but that's a bit less readable.

Or we could consider something with the exclusive/inclusive mathematical range notation (0,5] etc. Perhaps PROT[mass:19|group1(T->]EOSFORMS[<-S)group1].

On unlocalized tags

The unlocalized tag on [mass:80|Phospho|<->]PROTEOSFORMS seems to be "homeless" and could be mistaken for an N-terminal modification. Maybe we should consider something that looks more like the prefix or terminal tags, e.g. [mass:80|Phospho]?PROTEOSFORMS. Thoughts?

The confusion with N-terminal mods could be a problem, so we should consider that. That said, I think seeming homeless is the intent. We don't know what amino acid to assign it to. The question-mark notation does do a nice job of distinguishing these from N-terminal mods.

acesnik commented 6 years ago

https://github.com/topdownproteomics/sdk/pull/57#discussion_r216699023

acesnik commented 6 years ago

What about PRO[Phospho|#group]->TEOFORMS<-[#group] for ranges?

acesnik commented 6 years ago

It might be a bit weird to have the tag to the left of amino acids at the start of the range, but we are doing that with ? for unlocalized mods, now.