Breaking connections during experimental proteoform identification

acesnik commented 7 years ago

Some ambiguous families seem to have connections that are likely spurious mass differences that happened to fall within peaks. The problem of selecting edges to remove from families is different and somewhat harder than assigning identities to edges. First, in deciding whether to break an edge, one must consider whether any of multiple assignments for the connected experimentals is correct, and then decide whether family is better explained without that edge.

Example 1: A [-42 Da] edge connects a two proteoforms. In tracing the graph, this edge is interpreted as the loss of acetylation. However, the starting node doesn't have an acetylation to remove. Should this edge be broken? One must also consider the possibility that acetylation could be added to the target node and whether the assignment to the starting node was correct in the first place.
Example 2: Below is a picture of a family that has two connections that should be broken, and it illustrates the difficulty of breaking them. First, the 0.02 Da mass difference from a modified theoretical node is used to assign an experimental node. However, this mass difference is weaker than the 16.00 Da oxidation from an EE edge. Therefore, the ET edge should be broken. Second, there is an EE edge with 27.01 that cannot be assigned, and it joins two halves of the ambiguous family. I believe that one should be broken, too.

acesnik commented 7 years ago

I'm not sure how to generalize the principles for breaking edges, i.e. how to search the graph for which edge to break, but we know we can limit the problem to ambiguous families and maybe also hairnet families.

leahvschaffer commented 7 years ago

I think with Example 1, it's too ambiguous to break connections - since it's an ambiguous family I think it makes sense to leave that node as is... in unambiguous families however we could consider breaking edges that don't make sense, like -42 from a node without any PTMs for examples. Do you think automatic peak picking will help with Example 2? Then the 27.01 would never be picked in the first place. I think for the peak picking I was having it look +- half of the peak width around the PTMset mass, so that 0.02 would possibly not have been picked in the first place either.

acesnik commented 7 years ago

I agree it would be a good start to remove edges that correspond to PTM loss from unmodified proteoforms.

For Example 2, I think automatic peak picking would eliminate the problem of the 27.01 Da mass difference, but I think it's risky to always do that, since we would miss unexpected modifications.

leahvschaffer commented 7 years ago

I don't think it's an either or of peak picking or setting the hard cutoff based on min peak count. I think for any dataset it's good to at least glance at EE and make sure there's no new unaccounted for peaks, but I also think the other automated way of accepting peaks (just setting min peak count and moving on) while it might not miss an unexpected mod lets in lots of bad noise. Maybe more of a problem with unlabeled though.

leahvschaffer commented 7 years ago

Looking at this ambiguous family, so if we don't choose the 27.01 peak that takes care of that. I don't know if this would be too difficult to implement, but the oxidation explanation is much simpler than the other 0.02 identification, so somehow this could be used to separate out ambiguous families.

What does the ET histogram look like for this data? Is there a big peak at 0.02 or is it a small satellite peak of a large peak closer to 0? In my calibrated data, I'll see the large peak at 0, then smaller peaks hardly above noise around plus or minus .02. I don't accept these.

leahvschaffer commented 4 years ago

I added option to break bad EE connections

smith-chem-wisc / ProteoformSuite

Breaking connections during experimental proteoform identification #350