lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
201 stars 38 forks source link

Same dataset different results #54

Closed tiagosobreira closed 1 year ago

tiagosobreira commented 1 year ago

Hi Michael,

Thank you very much for this great tool!

I ran Sage twice using the same parameters on the same dataset and it gave me different results each time. Would you be able to explain why this occurs?

Thank you, Tiago

lazear commented 1 year ago

Hi Tiago,

Can you provide some more information?

How are you determining the results are different? How different are they?

tiagosobreira commented 1 year ago

Hi Michael,

Sure!

Version: sage-v0.10.0-x86_64-unknown-linux-gnu

It is one TMTpro sample collected on an Eclipse with FAIMS in three fractions (a total of three files)

Here is the config file: config.txt

I compared the spectrum_fdr from both results files.

spectrum_fdr_correlaction_between_two_runs

Here is one example: The spectrum 36109 has FDR 0.36757833 on the first run and 0.0017423321 on the second run.

30% of the spectra have an FDR difference > 0.1 between both runs.

Is this difference expected?

Thank you again, Tiago

lazear commented 1 year ago

There is definitely something wonky going on here! There are occasionally very minor differences in FDR/discriminant scores just due to floating point rounding/numerical instability (e.g. maybe a handful of PSMs are re-ranked on very large searches) but this is totally out of wack.

I just confirmed again that running sage-v0.10.0-x86_64-unknown-linux-gnu on the PXD003881 dataset (20 files) had identical results across 900k PSMs. I also previously tested for reproducibility on a 250-file TMT16 dataset before releasing v0.10 image

Your parameters look good overall, but there is one thing sticking out:

    "precursor_tol": {
        "ppm": [-100,500]
    }

Tolerances in Sage are specified in the reverse order from most other engines - they are applied to the observed/experimental mass, and not the theoretical one. E.g. for an open search you would specify (-500, 100) - an experimental mass of 2500 - 500 Da unknown mod.

Assuming you can't share the files (if you can, I will debug), can we try a couple things?

If this isn't it, let's try:

Finally, please run sage like so:

$ SAGE_LOG=trace ./sage <parameters.json

This will output more information - please paste it below! This should help diagnose what's going on.

tiagosobreira commented 1 year ago

Hi Michael,

Thank you very much for your help.

This is kind of embarrassing, but I made a silly mistake pairing the spectra. It is no precise as our data, but it seems reasonable.

spectrum_fdr_correlaction_between_two_runs1

Reducing the number of modifications improves the correlation even more. spectrum_fdr_correlaction_between_two_runs3

The "predict_rt" seems to be fine, and it is not interfering with the variation

The correction on how to use the "precursor_tol" made a huge difference in my analysis.

Thank you, Tiago

lazear commented 1 year ago

That's still more variation than I would expect, especially for a 3-file search.

Feel free to ask more questions if need be!