Closed zrolfs closed 7 years ago
Zach, please coordinate this with Rob, since this feature would be closely intertwined with his parsimony code. It sounds useful, but I need more detail to understand the visualization part better!
It could also be useful to note where coverage wasn't observed, and thus where PTM sites could not be observed -- clearly annotating missing information.
How about something like this?
mgkgtpsfgkrhnkshtlcnrcgrrSFHVQKKTCSSCGYPAAKtrsynwgakakrrHTTGTGRmrylkhvsrrFKN[Deamidation of N]GFQTGSASKasa
lowercase = not observed residue uppercase = observed residue brackets = mod at that location, same notation as PSM output
I like Rob's suggestion, but that can still be cumbersome for a user to wade through to find differences in observed PTMs between different samples/conditions. I think we would still benefit by implementing the PTMs with the observed residues to provide an overview of what PTMs were found, but perhaps also have an additional column with an output like:
aa76v:[Deamidation of N] | aa101:[Acetylation] | etc.
Where: "aa#" is position of PTM "v:" signifies that the given residue was observed both with and without that PTM "[*]" is the mod at that location, same notation as PSM output
You could put an estimated occupancy ratio (e.g. by PSM count) inside the brackets. xxxxxxxXXXX[mod1|info:occupancy=1]XXXX[mod2|info:occupancy=0.25][mod3|info:occupancy=0.5]xxXXXXXXXXX
In this example, Mod1 is observed in all PSMs Mod2 and mod3 are observed at the same site, which is unmodified 25% (1-0.25-0.5) of the time Mod2 is observed 25% of the time Mod3 is observed 50% of the time
I like the occupancy idea. It would still be nice to have a list of modifications with their indexes, though, so that you can easily compare what PTMs were observed where without having to scan through each protein.
I was thinking about occupancy last night, and if we're estimating occupancy by PSMs, then might it be useful to include the number of modified PSMs observed over the total? Rather than put "occupancy=0.5", have something like "occupancy=0.5 (2/4)", where two modified and two unmodified PSMs were detected. I'm worried that it might start to look cluttered, but I think that's valuable information for determining the accuracy of that ratio. For example, I would have greater confidence that all of the proteins possessed a given modification if it read "occupancy=1 (6/6)" rather than "occupancy=1 (1/1)". Likewise, the uncertainty in the occupancy for "occupancy=0.5 (1/2)" is much greater than for "occupancy=0.5 (5/10)". This would also provide a rough quantification when users are scanning the output.
Here's the CTDP document on proteoform nomenclature. You could use this format for annotating the PTM information. https://docs.google.com/document/d/1SpAQR8aPc2cCXXSjUobg_VXC85br1WukBKNZh5TPaQ8/edit#
What does Occupancy "occupancy=1 (6/6)" mean? If there is an example show me the idea of how to calculate, that will be great. Now I can print something like this:... #aa324[Calcium on D|info:occupancy=]#aa...
it would be that out of 6 PSMs detected for the peptide base sequence, all 6 have that modification. so for example you observe these PSMs:
PEPTIDE[mod1] PEPTIDE[mod1] PEP[mod2]TIDE[mod1] PEPTIDE[mod1]
you could say [mod1:occupancy=1 (4/4)] [mod2:occupancy=0.25(1/4)]
I wrote some code yesterday to output the number of sites on each protein where a given PTM occurs. I think adding that output to MetaMorpheus would be a fast, easy way for users to visualize the PTM sites observed for each protein. I'll clean up the code and do a pull request sometime, if you think it would be worthwhile?