lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
201 stars 38 forks source link

Decharged precursors and tolerance #132

Open glormph opened 2 months ago

glormph commented 2 months ago

Hi, When configuring the precursor tolerance in ppm, Sage seems to report the precursor as a decharged mass, i.e. an 800 experimental m/z, charge 2 peptide is reported as 1600 in the output. Calculated peptide mass, and precursor error in ppm follow this behaviour as well.

When comparing with my existing workflow using MSGF, this gives very different precursor errors, depending on charge state and size. Should the "precursor_ppm" field instead be reported in m/z? As an example, I found a peptide AAAFEEQENETVVVK:

method expmass calc mass precursor_ppm peptide mass
decharged 1662.8123 1662.8099 1.3948283 1662.8101236
m/z 832.4153 832.4123 3.5195148 1662.8101236

Another question I'm wondering about, is there a reason why the precursor errors are always positive? If they are just absoluted, it could be good to keep the sign instead to show instrument drift in some direction.

lazear commented 2 months ago

is there a reason why the precursor errors are always positive?

Fragment errors are reported as the intensity-weighted average of the absolute mass errors between theoretical and experimental MS2 peaks, so I think I had precursor_ppm also be absoluted (and for use as a feature in LDA). I'm not opposed to reporting precursor_ppm as the non-absoluted value.

If they are just absoluted, it could be good to keep the sign instead to show instrument drift in some direction

For tracking mass errors over time (or absolute mass error in Da, for use in open searches - which is a primary use case for Sage), you can always manually calculate ppm from the provided fields.

I'm not too worried about including too many theoretical peptides to find each solution

It's a valid point - calculating error on the de-charged mass simplifies open-search tolerances (which are calculated in Da, not m/z). There may be some edge cases where the best match is outside of the tolerance when calculated on de-charged precursors, but inside of the tolerance on m/z level. If you're willing to investigate, I can give some pointers on how/where to change some code.

glormph commented 2 months ago

Ah, I have indeed only ran this on a narrow 10ppm tolerance search, and had not really considered how to run open yet.

We've looked at some peptides in an MSGF - Sage comparison, and there's some that are unique to either engine. Except the unique to MSGF ones have a gaussian distribution of the prec. error (tapering off towards the tail end), while the Sage-unique peptides are more "a big peak in the middle and then a flat tail", i.e the tail may be quite low but doesn't taper to the end.

As the a tolerance/error in ppm for a PSM is sometimes larger and sometimes smaller than the respective tolerance or error calculated using m/z (depending on mz, charge, peptide mass etc), my theory is that in a lot of those tail end cases the tolerance in Sage is "smaller", and the better match has not been included. Conversely, if the tolerance is bigger it'd only include some more peptides which are not going to be the top match, especially since there's also the fragment index match. But I'd have to do some more diving to see if that is correct.

Anyway, hats off to this project, and very nice to see a search finish this fast :)

lazear commented 2 months ago

I shouldn't have responded before having coffee - the calculation for ppm is actually charge state independent.

2E-6 (m1 - m2) / (m1 + m2) is the same as 2E-6 (m1/z - m2/z) / (m1/z + m2/z) 😄

glormph commented 2 months ago

I thought the mass error calc was 1E+6 * (observed_mz - theoretical_mz) / theoretical_mz when using m/z, and had assumed when using mass, it would be: 1E+6 * (m/z * ch - theoretical_mass) / theoretical_mass.

Lots of editing here :D. I'm a bit confused, but the two methods (2e6 "avg mass", or 1e6) seem to have negligable difference. It could also be dependent on how the precursor m/z is read from the mzML, from selectedIon or e.g.

<userParam name="[Thermo Trailer Extra]Monoisotopic M/Z:" value="563.27482031738577" type="xsd:float"/>
glormph commented 2 months ago

I can also think of another case for not absoluting the mass error: if the instrument has drifted and all your good peptides sit at +3 ppm, a -2 ppm peptide will be a lot worse than a +2 ppm one, which would affect e.g. downstream rescoring with percolator.

lazear commented 2 months ago

I thought the mass error calc was 1E+6 (observed_mz - theoretical_mz) / theoretical_mz when using m/z, and had assumed when using mass, it would be: 1E+6 (m/z * ch - theoretical_mass) / theoretical_mass.

Same property holds for this one! The two equations should give identical results, since we hold z constant for both the observed and theoretical peptides. Using the "avg mass" equation, ppm calculations are symmetric (i.e. agnostic to which mass is the "theoretical" or "observed" one).

During actual candidate selection, Sage applies tolerances to the observed precursor mass, not the theoretical mass (this makes more sense in an open-search/chemical modification context) - so using the "avg mass"/symmetric equation makes sense in this context.

glormph commented 1 month ago

Yes that is true, identical results for both those, and I think I understand the tolerance/error calculations better now, thank you :).

To me what remains of this issue is