lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
201 stars 38 forks source link

Interfacing Sage results to protein group FDR package #102

Closed MatthewThe closed 7 months ago

MatthewThe commented 7 months ago

Hi Michael,

Thanks for the great tool! I'm trying to interface the sage output with my protein group FDR tool (https://github.com/kusterlab/picked_group_fdr) which also could provide quantification for protein groups.

The regular search result output seemed to contain all columns needed. However, if I understand it correctly, the ms1_intensity column does not correspond to the area of the elution peak but rather the MS1 intensity at the corresponding MS1 scan. Would it be somehow possible to add the XIC of the MS1 peak to this file?

Alternatively, I thought about using the LFQ output file. But from your paper, it seems that the score and q-value are purely related to the XIC and do not take PSM identification confidence into account. Is that correct? Also, is it combining intensities from different charge states? I do get a charge column in the LFQ output file, but it's all filled with -1.

Thanks, Matthew

Sage version: 0.14.4

lazear commented 7 months ago

That is correct - the ms1_intensity column is pretty useless (it is the MS1 intensity of the selected ion listed in the mzML file), and should just be removed. Adding the XIC (in isolation, per run) sounds somewhat complicated, since peak boundaries, etc might differ.

I would recommend using the LFQ output file. The score and q-value (from the LFQ file) are purely related to the XIC. They do not take PSM identification confidence into account, but they do take peptide identification into account. Only peptides with a global q-value < 0.01 (peptide_q from the main results file) will have XICs extracted - but the XICs will be extracted from every run. This could be an issue if you're re-scoring peptides as well - any peptides with Sage calculated q-value > 0.01 will have no quantification data.

Combining intensities from different charge states is performed by default (hence -1 charge) - you can set quant.lfq_settings.combine_charge_state to false to output values for each charge state instead.

lazear commented 7 months ago

I'm going to close as completed - please reopen if something needs a fix for integration with your tool!

MatthewThe commented 7 months ago

Apologies for not getting back to you earlier. Your comments were very helpful!

I integrated support for Sage in my protein group FDR tool now: https://github.com/kusterlab/picked_group_fdr/tree/develop/data/sage_example

As Sage didn't have a protein level output format, I allowed output in both MaxQuant (proteinGroups.txt) and FragPipe (combined_protein.tsv) format.

lazear commented 7 months ago

Excellent! Any issues with FDR control discovered?

I'll also give it a test... I have a lot of datasets to play around with :)

MatthewThe commented 7 months ago

Protein group-level FDR looked fine on a small test dataset but haven't done more extensive testing so far. Would be great to see how it looks on one of your datasets.

I did decide to ignore the XIC q-values in the lfq.tsv file. I used the PSM-level q-values from results.sage.tsv instead and mapped them to their corresponding precursor in lfq.tsv.