smith-chem-wisc / ProteoformSuite

Construction, quantification, and visualization of proteoform families
https://smith-chem-wisc.github.io/ProteoformSuite/
GNU General Public License v3.0
12 stars 19 forks source link

Breaking connections during experimental proteoform identification #350

Closed acesnik closed 4 years ago

acesnik commented 7 years ago

Some ambiguous families seem to have connections that are likely spurious mass differences that happened to fall within peaks. The problem of selecting edges to remove from families is different and somewhat harder than assigning identities to edges. First, in deciding whether to break an edge, one must consider whether any of multiple assignments for the connected experimentals is correct, and then decide whether family is better explained without that edge.

acesnik commented 7 years ago

I'm not sure how to generalize the principles for breaking edges, i.e. how to search the graph for which edge to break, but we know we can limit the problem to ambiguous families and maybe also hairnet families.

leahvschaffer commented 7 years ago

I think with Example 1, it's too ambiguous to break connections - since it's an ambiguous family I think it makes sense to leave that node as is... in unambiguous families however we could consider breaking edges that don't make sense, like -42 from a node without any PTMs for examples. Do you think automatic peak picking will help with Example 2? Then the 27.01 would never be picked in the first place. I think for the peak picking I was having it look +- half of the peak width around the PTMset mass, so that 0.02 would possibly not have been picked in the first place either.

acesnik commented 7 years ago

I agree it would be a good start to remove edges that correspond to PTM loss from unmodified proteoforms.

For Example 2, I think automatic peak picking would eliminate the problem of the 27.01 Da mass difference, but I think it's risky to always do that, since we would miss unexpected modifications.

leahvschaffer commented 7 years ago

I don't think it's an either or of peak picking or setting the hard cutoff based on min peak count. I think for any dataset it's good to at least glance at EE and make sure there's no new unaccounted for peaks, but I also think the other automated way of accepting peaks (just setting min peak count and moving on) while it might not miss an unexpected mod lets in lots of bad noise. Maybe more of a problem with unlabeled though.

leahvschaffer commented 7 years ago

Looking at this ambiguous family, so if we don't choose the 27.01 peak that takes care of that. I don't know if this would be too difficult to implement, but the oxidation explanation is much simpler than the other 0.02 identification, so somehow this could be used to separate out ambiguous families.

What does the ET histogram look like for this data? Is there a big peak at 0.02 or is it a small satellite peak of a large peak closer to 0? In my calibrated data, I'll see the large peak at 0, then smaller peaks hardly above noise around plus or minus .02. I don't accept these.

leahvschaffer commented 4 years ago

I added option to break bad EE connections