Open acesnik opened 6 years ago
We should build a way to specify ambiguity both: 1) groups of amino acids, like all T, S, and Y in a sequence 2) regions of ambiguity, like an unidentified mass along a fragment or along a whole sequence
Specifying indices of other sites is helpful for computers, but it's terrible for human readability. I don't want to have to count 200 amino acids to the right of a site to find the next possible site and so on.
We have the info
key as a catchall for random other things we want to build in. I suggest we use that to make these groupings.
1) PROT[Phospho|info:group:1]EOFORMS[info:group:1]
In this example, we have a phospho that might be at one of two locations. Within the value of the info
tags, we use the subkey group
to denote the ambiguity group, followed by a unique (sub)value for the group. Each subsequent location has only the unique group information within the info
tag.
Note that Phospho
is only specified at the first mention of this group. It would get very confusing if it were specified at every one, since it would look like there was a phospho identified at each of those sites.
2) PROT[mass:19|info:start:1]EOFORMS[info:end:1]
In this example, we have an unidentified mass that found along a fragment of the sequence. The unidentified mass (+19 Da in this case) is specified at the start of this region. The start of the region is specified within an info
descriptor, with a start
subkey and a unique (sub)value for the region. The end of the region is specified within an info
descriptor, too, with an end
subkey and the same unique (sub)value for the region.
I like this in general and I think it is very clean, but I had imagined going further with ambiguity in ProForma v2. What you proposed is still valid in the current standard. I was thinking of it as the addition of specific keys or special characters. For example, how about something like this (using your examples above):
PROT[Phospho|#test123]EOFORMS[#test123]
PROT[mass:19|A->]EOFORMS[<-A]
You get the idea, hashtags for grouping and arrows for ranges. I'm also toying around with the idea of using a ?
character to specifically denote ambiguity. My thought is that this could help downstream consumers to more easily determine what is fully characterized and what isn't.
This seems really good for "nesting" unknowns shifts. For example
PRO[mass:19|A->]TEO[mass:99|B->]FO[<-A]RMS[<-B]
read as mass 19 in sequence region TEOFO and mass 99 in sequence FORMS
I like the idea of special characters to denote ambiguity and the #
, ->
, and <-
for doing so. We would need to examine the collision with how we're specifying modifications without keys (for human readability; Unimod Interim name by default).
Would you be opposed to putting these codes into an info descriptor to not break ProForma v1?
PROT[Phospho|info:#test123]EOFORMS[info:#test123]
PROT[mass:19|info:A->]EOFORMS[info:<-A]
We could also propose making new keys:
PROT[Phospho|#:test123]EOFORMS[#:test123]
PROT[mass:19|->:A]EOFORMS[<-:A]
We could also use the vi
symbols for beginning and end, i.e. ^
and $
:
PROT[Phospho|info:#test123]EOFORMS[info:#test123]
PROT[mass:19|info:^A]EOFORMS[info:$A]
I also want to note how using just '>' and '<' instead of arrows might look:
PROT[Phospho|info:#test123]EOFORMS[info:#test123]
PROT[mass:19|info:A>]EOFORMS[info:<A]
I like this discussion, should be great for the next meeting.
For the groups, I feel like the hash symbol needs to be directly next to the group name to make it look like a hash tag that people expect. So I like [info:#test123]
over the new key approach.
Re: the ranges, I like almost everything proposed except the vi stuff. Wouldn't help readability for most people IMO (but hey, I'm an emacs guy). I find the arrow a bit more readable than just the less than/greater than symbols, but I don't feel that strongly.
I like arrow. Better than > b/c it's too close to FASTA hearder for my liking (e.g. ">sp|P17677|NEU...")
Yep, very nice discussion!
I have a few comments: a) It might be good to avoid using too many special characters. Not only because they are already are reserved for special purposes (e.g. # for comments) but also when using the proteoform notation in a document (paper, report, ...) where special characters can be problematic. For instance, think about submitting an abstract to a conference. These webpages can be quite demanding. b) Groups: I suggest we reserve a specific key for them like "group:" instead of adding them to the info tag. Something like [group:A] c) Groups: Many times, ambiguous sites get assigned a probability or score. What about something like [group:A:95]? d) Ranges: I am on Ryan's side (emacs) but would avoid the "<,>". What about PROT[mass:19|range:A-]EOFORMS[range:-A]?
2018-02-13 16:06 GMT+01:00 Ryan Fellers notifications@github.com:
I like this discussion, should be great for the next meeting.
For the groups, I feel like the hash symbol needs to be directly next to the group name to make it look like a hash tag that people expect. So I like [info:#test123] over the new key approach.
Re: the ranges, I like almost everything proposed except the vi stuff. Wouldn't help readability for most people IMO (but hey, I'm an emacs guy). I find the arrow a bit more readable than just the less than/greater than symbols, but I don't feel that strongly.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/topdownproteomics/TestLib/issues/17#issuecomment-365294043, or mute the thread https://github.com/notifications/unsubscribe-auth/APEZhV-r-6doEuORmH2bXrf4hLgH8ujEks5tUaTagaJpZM4R7rr4 .
-- |||/ (o o) ----ooO-(_)-Ooo----
Don't worry about life; you're not going to survive it anyway.
I feel like statistical measures (probability score, localization scores, FDRs) are often particular to certain software. I suggest we place those in info tags.
If we are willing to make new keys, we could use group:
, start:
and end:
, too. That would eliminate special characters if that's helpful.
Here's an example. I see Shortreed's point that arrows make this easier to scan quickly.
PRO[mass:19|start:A]TEO[mass:99|start:B]FO[end:A]RMS[end:B]
Regarding ">" and "<", I don't think we be too concerned with them in descriptors. There are already plenty of Unimod entries with these characters.
I also thought about the statistical measures such as score for the localization. If we use hashes, then hash + the site name (i.e. #mod
) could become key and the score is the value. This is a bit crazy but why not?
PROT[Phospho|#mod:20]EOFORMS[Phospho|#mod:80]
I like that idea, @veitveit. The default with no score (no value) could be assume to be equal scores.
How would we specify what kind of score they are? Info tags? I imagine we could have probabilities be the default, but there are many scores out there that could be applied, like counts of observations.
If we assume that there is only one type of score, then it would not matter in the end whether a score giving probabilities, expectation values, or other measures. I would just say that it should be a number.
That seems reasonable. I suppose we could require a header in the file format with a note on which score was used.
## fileformat=ProForma2.0
## scoretype=percentage
> header
PROT[Phospho|#mod:20]EOFORMS[Phospho|#mod:80]
New issues to consider from this pull request https://github.com/topdownproteomics/sdk/pull/57
Is the amino acid T included in the range "A" within
PROT[mass:19|A->]EOSFORMS[<-A]
? If so, we might want to rethink this syntax as the arrows seem to indicate otherwise. Thoughts?"I see your point; it does point only to the internal sequence.
But I think this comes back to the left-right issue. Each tag pertains to the amino acid to the left in ProForma v1. I suppose we could go with one of the earlier suggestions
PROT[mass:19|start:group1]EOSFORMS[end:group1]
but that's a bit less readable.Or we could consider something with the exclusive/inclusive mathematical range notation (0,5] etc. Perhaps
PROT[mass:19|group1(T->]EOSFORMS[<-S)group1]
.
The unlocalized tag on
[mass:80|Phospho|<->]PROTEOSFORMS
seems to be "homeless" and could be mistaken for an N-terminal modification. Maybe we should consider something that looks more like the prefix or terminal tags, e.g.[mass:80|Phospho]?PROTEOSFORMS
. Thoughts?The confusion with N-terminal mods could be a problem, so we should consider that. That said, I think seeming homeless is the intent. We don't know what amino acid to assign it to. The question-mark notation does do a nice job of distinguishing these from N-terminal mods.
What about PRO[Phospho|#group]->TEOFORMS<-[#group]
for ranges?
It might be a bit weird to have the tag to the left of amino acids at the start of the range, but we are doing that with ?
for unlocalized mods, now.
This issue was left unanswered in the standard. Let's come up with a proposal, implement it, and see if the committee for the standard has any issues with it.