lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
201 stars 38 forks source link

Feature: ims model #98

Closed jspaezp closed 5 months ago

jspaezp commented 8 months ago

This PR adds two main things.

  1. Reading of 1/k0 values from timstof (.d/*.tdf) and .mzml files (when available), as well as reporting them.
  2. Modeling of the 1/k0 for rescoring. The model is greatly based on the retention time model already being used, with some "minor" feature engineering.
    • Some of the prediction logic has also been changed so the mobility model can fail independently from the rt alignment (failed IMS predictions dont break alignment).
    • It is worth noting that this rescoring (at least for tryptic peptides) only increases the number of identified peptides with q<0.01 by ~1% for most of my tryptic peptide data.

Discussed here: https://github.com/lazear/sage/issues/73 LMK what you think! Best -Sebastian

lazear commented 8 months ago

The prediction model is strikingly good - even for decoys (this is just a small test ddaPASEF file) ... is this expected behavior?

Side note, the features you've engineered seem to do a better job at RT prediction on a few small datasets, seems worth testing on a larger set and seeing if we can unify RT/IM prediction modules

image

Seems like m/z is driving most of it - should we be predicting CCS instead? For either IM or CCS prediction, I think what we need to predict is the residual vs the m/z trend

image
jspaezp commented 8 months ago
  1. To some degree this is expected. Since (1) the tims dimension does not have that high of a peak capacity/resolution and (2) there is a very tight correlation between ook0 ~ mz + 1/charge; I would assume that the target-decoy pairs would have very close 1/k0 values; unless the decoys are outrageously different or are assigned to a different charge (which happens very little, since the windows allowed for fragmentation are already constrained, and most experiments will have constrains on charge state for the selection for fragmentation). If the performance gain is not enough to be worth adding to the project I think it is fine to remove it for the same of simplicity in maintenance (But i would really like to keep the reporting of the mobility values, which I think will be required for any good LFQ implementation, which spoiler alert ... I am also working on).

This is a plot from the Tenzer lab showing the precursor intensity (top cloud is z=1, second cloud is mostly z=2 and that tail to the right is mostly z3+) images_medium_pr0c00962_0003 Fig3 https://doi.org/10.1021/acs.jproteome.0c00962

  1. I do like the idea of unifying the models if both can make use of the same features (I am assuming we would ignore charge related features in the RT model).
    • On the implementation side of things I was thinking on a trait (AttributeModel ??) that implements embed, _embed and predict, with the default implementation of _embed being what it is right now for the ims model and the rt model just popping the charge features out of the vec (althought it should not really matter, the model should just set that coefficient to ~0).

LMK what you think! (ps: thanks for the project! this has been amazing for my rust learning (Journey)|(Struggle))

lazear commented 8 months ago

In that case, let's continue reporting IM then - I think we can (and should) try improving the model to see if it works better. Once we've tried that, we can attempt to unify the models (or at the very least, feature generation).

I did some more reading on CCS prediction (I know very little about IM/CCS ☹️) and think it might be worth trying a two pass approach.

Pass one: we fit a linear regression to IM ~ m/z (or CCS ~ m/z) Pass two: we fit the full featured model to the residual (vertical difference between observed IM and predicted IM from the ion's m/z) - This models the peptide sequence's contribution to IM independent of it's m/z, which is what should improve discriminative power

Inspired by: https://github.com/theGreatHerrLebert/ionmob#getting-insight-into-driving-factors-of-ccs image

jspaezp commented 8 months ago

Improving the model:

Just as a reference, it seems like the upper theoretical bound of prediction accuracy would be r2=0.998[ccs] (measurement error) as of Fig3 in: https://doi.org/10.1038/s41467-021-21352-8

Complex models, (BILSTM in that paper) achieve r2=0.992[ccs] and (lgbm with VERY SIMPLE features https://github.com/TalusBio/flimsay) r2=0.9908[ccs] (0.9845[1/k0], 0.987948[1/k0] all features I could think of :P).

I tried the "boosted" model and it does not seem to improve dramatically anything on my data https://github.com/jspaezp/sage/commit/193bbbdf1fcdc7db230d5e3313041c8310cbfe11

Calculating CCS:

According to doi: 10.1074/mcp.TIR118.000900

image

Which in my attempt to do some math ... image

I am still to figure out the dimensional analysis that would simplify to a CCS whose units are $angstrom ^ 2$ ($k_0 = Vs cm^{-2}$) ...

BUT it would be a pretty direct equivalence to a scaled version of the $1/k_0 charge 1/\sqrt{mass}$, we could try to predict that instead.

also ... based on this figure I am not sure if our residuals off the prediction of mz + charge will be all that much better.

image
lazear commented 8 months ago

https://github.com/theGreatHerrLebert/ionmob/blob/2af2a270957ab4363567b978b48161b256e86980/ionmob/utilities/chemistry.py#L108 has a simplified version!

lazear commented 8 months ago

Interesting that there isn't that much improvement - idk how much CCS/IM prediction should affect rescoring though, so maybe it's OK?

jspaezp commented 8 months ago

56392af6ff48797295ad8cb38296c2f238420c3d <- tried the CCS conversion and it seems to do a hair worse than the raw 1/k0.

My conclusions atm:

  1. At least for only tryptic peptide closed searcher, IMS prediction in general has little effect on rescoring.
  2. There is also little difference between predicting CCS vs 1/k0

Future direction:

  1. Decide whether to keep the model or not. It could be kept just in case it is good in some scenario (maybe immunopeptidomics) ... it might also not change a lot the number of IDs but might shift a couple to be accurate ones in the edge (we could try to do an entrapment experiment, not sure it is worth the time RN ...).
    • If we keep it, is there going to be a unification of the feature generation?
    • If not, do you want a PR with the "new" features to the rt? I could also just make a very slim PR adding the storing/reporting of the ion mobilities (which would be critical for the future addition of lfq).

thanks a lot for the feedback!

lazear commented 8 months ago

I think we should keep it - I can mess around with unification of feature generation too.

I want to run a couple more tests, then I will merge!

theGreatHerrLebert commented 7 months ago

Hey guys,

I just came across this feature request and think it's a really cool idea! :) I wanted to quickly share some information that might help you further improve it. I’ll also discuss what to expect from re-scoring with CCS (Collision Cross Section) features, as we've put a lot of work into this over the past two years.

Understanding CCS vs. 1/K0

While it may seem obvious, it's worth reminding that CCS is indirectly calculated from measured raw 1/K0 values, and for that you always need to know the charge state of the ion. This is just a practical point.

Limitations of the Mason-Schamp Equation

When translating 1/K0 to CCS using the Mason-Schamp equation, consider these limitations: Firstly, this equation is only effective for low-field devices. Secondly, the setup of the ion mobility (IM) device, such as the TIMS (Trapped Ion Mobility Spectrometry), significantly influences the accuracy of converting inverse mobilities to CCS for singly charged ions. The default settings might not be ideal for accurate results and CCS values are often incorrect for online available training data.

Factors Influencing CCS Translation with the MSE

The Mason-Schamp equation also factors in the drift gas mass, which varies between devices and should be an input argument in your translation functions. Gas pressure and temperature of the drift gas, typically not controlled during the experiment, also play a role. The default values from our repository align with our lab conditions, so recalibration of predicted CCS values is necessary for practical application. Here, SAGE could be particularly useful in recalibration using high-confidence identified peptides.

CCS Values in Re-scoring

We've also concluded that CCS values don't significantly enhance re-scoring likely because IM, unlike retention time, is highly correlated with the mass of the ion. However, significant improvements in re-scoring for singly charged ions were observed in immunopeptidomics, indicating that CCS's utility in re-scoring is dependent on the sample and acquisition context.

I hope this information is helpful.

Best,

David