Same dataset different results

tiagosobreira commented 1 year ago

Hi Michael,

Thank you very much for this great tool!

I ran Sage twice using the same parameters on the same dataset and it gave me different results each time. Would you be able to explain why this occurs?

Thank you, Tiago

lazear commented 1 year ago

Hi Tiago,

Can you provide some more information?

Which version of Sage are you using
How many files are in the dataset
Ideally, the exact parameters file

How are you determining the results are different? How different are they?

tiagosobreira commented 1 year ago

Hi Michael,

Sure!

Version: sage-v0.10.0-x86_64-unknown-linux-gnu

It is one TMTpro sample collected on an Eclipse with FAIMS in three fractions (a total of three files)

Here is the config file: config.txt

I compared the spectrum_fdr from both results files.

spectrum_fdr_correlaction_between_two_runs

Here is one example: The spectrum 36109 has FDR 0.36757833 on the first run and 0.0017423321 on the second run.

30% of the spectra have an FDR difference > 0.1 between both runs.

Is this difference expected?

Thank you again, Tiago

lazear commented 1 year ago

There is definitely something wonky going on here! There are occasionally very minor differences in FDR/discriminant scores just due to floating point rounding/numerical instability (e.g. maybe a handful of PSMs are re-ranked on very large searches) but this is totally out of wack.

I just confirmed again that running sage-v0.10.0-x86_64-unknown-linux-gnu on the PXD003881 dataset (20 files) had identical results across 900k PSMs. I also previously tested for reproducibility on a 250-file TMT16 dataset before releasing v0.10

Your parameters look good overall, but there is one thing sticking out:

    "precursor_tol": {
        "ppm": [-100,500]
    }

Tolerances in Sage are specified in the reverse order from most other engines - they are applied to the observed/experimental mass, and not the theoretical one. E.g. for an open search you would specify (-500, 100) - an experimental mass of 2500 - 500 Da unknown mod.

Assuming you can't share the files (if you can, I will debug), can we try a couple things?

Try searching with a smaller fasta file or with fewer variable mods - if there are more than 4294967295 (2^32) fragments generated, it could cause catastrophic failures like this. I suspect you may be very close to this number given your parameters.

If this isn't it, let's try:

Try searching with a fragment tolerance of (-100 to +100)
Turn off RT matching "predict_rt": false (this might be screwing with the rescoring)

Finally, please run sage like so:

$ SAGE_LOG=trace ./sage <parameters.json

This will output more information - please paste it below! This should help diagnose what's going on.

tiagosobreira commented 1 year ago

Hi Michael,

Thank you very much for your help.

This is kind of embarrassing, but I made a silly mistake pairing the spectra. It is no precise as our data, but it seems reasonable.

spectrum_fdr_correlaction_between_two_runs1

Reducing the number of modifications improves the correlation even more. spectrum_fdr_correlaction_between_two_runs3

The "predict_rt" seems to be fine, and it is not interfering with the variation

The correction on how to use the "precursor_tol" made a huge difference in my analysis.

Thank you, Tiago

lazear commented 1 year ago

That's still more variation than I would expect, especially for a 3-file search.

Feel free to ask more questions if need be!

lazear / sage

Same dataset different results #54