Closed jspaezp closed 5 months ago
The prediction model is strikingly good - even for decoys (this is just a small test ddaPASEF file) ... is this expected behavior?
Side note, the features you've engineered seem to do a better job at RT prediction on a few small datasets, seems worth testing on a larger set and seeing if we can unify RT/IM prediction modules
Seems like m/z is driving most of it - should we be predicting CCS instead? For either IM or CCS prediction, I think what we need to predict is the residual vs the m/z trend
ook0 ~ mz + 1/charge
; I would assume that the target-decoy pairs would have very close 1/k0 values; unless the decoys are outrageously different or are assigned to a different charge (which happens very little, since the windows allowed for fragmentation are already constrained, and most experiments will have constrains on charge state for the selection for fragmentation). If the performance gain is not enough to be worth adding to the project I think it is fine to remove it for the same of simplicity in maintenance (But i would really like to keep the reporting of the mobility values, which I think will be required for any good LFQ implementation, which spoiler alert ... I am also working on).This is a plot from the Tenzer lab showing the precursor intensity (top cloud is z=1, second cloud is mostly z=2 and that tail to the right is mostly z3+) Fig3 https://doi.org/10.1021/acs.jproteome.0c00962
AttributeModel
??) that implements embed
, _embed
and predict
, with the default implementation of _embed being what it is right now for the ims model and the rt model just popping the charge features out of the vec (althought it should not really matter, the model should just set that coefficient to ~0).LMK what you think!
(ps: thanks for the project! this has been amazing for my rust learning (Journey)|(Struggle)
)
In that case, let's continue reporting IM then - I think we can (and should) try improving the model to see if it works better. Once we've tried that, we can attempt to unify the models (or at the very least, feature generation).
I did some more reading on CCS prediction (I know very little about IM/CCS ☹️) and think it might be worth trying a two pass approach.
Pass one: we fit a linear regression to IM ~ m/z (or CCS ~ m/z) Pass two: we fit the full featured model to the residual (vertical difference between observed IM and predicted IM from the ion's m/z) - This models the peptide sequence's contribution to IM independent of it's m/z, which is what should improve discriminative power
Inspired by: https://github.com/theGreatHerrLebert/ionmob#getting-insight-into-driving-factors-of-ccs
Just as a reference, it seems like the upper theoretical bound of prediction accuracy would be r2=0.998[ccs] (measurement error) as of Fig3 in: https://doi.org/10.1038/s41467-021-21352-8
Complex models, (BILSTM in that paper) achieve r2=0.992[ccs] and (lgbm with VERY SIMPLE features https://github.com/TalusBio/flimsay) r2=0.9908[ccs] (0.9845[1/k0], 0.987948[1/k0] all features I could think of :P).
I tried the "boosted" model and it does not seem to improve dramatically anything on my data https://github.com/jspaezp/sage/commit/193bbbdf1fcdc7db230d5e3313041c8310cbfe11
According to doi: 10.1074/mcp.TIR118.000900
Which in my attempt to do some math ...
I am still to figure out the dimensional analysis that would simplify to a CCS whose units are $angstrom ^ 2$ ($k_0 = Vs cm^{-2}$) ...
BUT it would be a pretty direct equivalence to a scaled version of the $1/k_0 charge 1/\sqrt{mass}$, we could try to predict that instead.
also ... based on this figure I am not sure if our residuals off the prediction of mz + charge will be all that much better.
Interesting that there isn't that much improvement - idk how much CCS/IM prediction should affect rescoring though, so maybe it's OK?
56392af6ff48797295ad8cb38296c2f238420c3d <- tried the CCS conversion and it seems to do a hair worse than the raw 1/k0.
My conclusions atm:
Future direction:
thanks a lot for the feedback!
I think we should keep it - I can mess around with unification of feature generation too.
I want to run a couple more tests, then I will merge!
Hey guys,
I just came across this feature request and think it's a really cool idea! :) I wanted to quickly share some information that might help you further improve it. I’ll also discuss what to expect from re-scoring with CCS (Collision Cross Section) features, as we've put a lot of work into this over the past two years.
While it may seem obvious, it's worth reminding that CCS is indirectly calculated from measured raw 1/K0 values, and for that you always need to know the charge state of the ion. This is just a practical point.
When translating 1/K0 to CCS using the Mason-Schamp equation, consider these limitations: Firstly, this equation is only effective for low-field devices. Secondly, the setup of the ion mobility (IM) device, such as the TIMS (Trapped Ion Mobility Spectrometry), significantly influences the accuracy of converting inverse mobilities to CCS for singly charged ions. The default settings might not be ideal for accurate results and CCS values are often incorrect for online available training data.
The Mason-Schamp equation also factors in the drift gas mass, which varies between devices and should be an input argument in your translation functions. Gas pressure and temperature of the drift gas, typically not controlled during the experiment, also play a role. The default values from our repository align with our lab conditions, so recalibration of predicted CCS values is necessary for practical application. Here, SAGE could be particularly useful in recalibration using high-confidence identified peptides.
We've also concluded that CCS values don't significantly enhance re-scoring likely because IM, unlike retention time, is highly correlated with the mass of the ion. However, significant improvements in re-scoring for singly charged ions were observed in immunopeptidomics, indicating that CCS's utility in re-scoring is dependent on the sample and acquisition context.
I hope this information is helpful.
Best,
David
This PR adds two main things.
Discussed here: https://github.com/lazear/sage/issues/73 LMK what you think! Best -Sebastian