IRS normalisation without imputation

hosseinvk commented 1 year ago

Hi Phillip,

I wonder if IRS normalisation is still valid on data with NAs (without imputation)? In that case, to get the geometric average intensity for each protein, the below modifications will be required as following:

# make new data frame with row sums from each frame
irs <- tibble(rowSums(exp1_sl,na.rm = TRUE), rowSums(exp2_sl,na.rm = TRUE), rowSums(exp3_sl,na.rm = TRUE))
# get the geometric average intensity for each protein
irs$average <- apply(irs, 1, function(x) exp(mean(log(x),na.rm = TRUE)))

Another point I'd like to mention is when a routine IRS normalisation is applied post imputation (unlike the above case), would it be still appropriate to implement ComBat normalisation on top of IRS-normalised data?

Thanks for your advice.

pwilmart commented 1 year ago

Hi, That is a good question. Very low abundance peptides can have some missing values across the channels and you can end up with one of the two reference channel values missing (or both missing). It is probably the right thing to average (geometric mean or arithmetic mean does not matter so much) when you have two actual measurements. When you have only one of two measurements, it is probably better to go with just the one measurement (which I think your code would do).

In my pipeline, I test the reporter ion in each scan against a minimum value. I do a trimmed average so that an intense interfering ion or a missing value are removed (mostly) and test that against something sensible. Different instruments in different modes can have different lower limits of reporter ion values. I use peak heights for reporter ions and I mostly process SPS-MS3 data from a Fusion. We see intensities down to about 300 then they disappear. Thermo subtracts noise levels from the FT analyzers, so signals do not go all the way down to zero. I usually test the trimmed average to be 500 or more. If the average is below 500, I zero out the reporter ions. That way these PSMs contribute to protein inference and protein IDs, but contribute nothing to the summed intensity for proteins. After summing to the protein level with a two peptide per protein requirement, there are still a few proteins that end up with a zero intensity total. I do a small value imputation for any remaining zero channel values. I use 150 with the Fusion data (something that is about half the smallest reporter ion peak height that is measured.

I should do a more careful treatment of missing values in the reference channels but have not had time. It is a case of a few very low abundance protein that would be affected. MaxQuant is not as sensitive as my pipeline and might have more missing values, so some explicit logic for the reference channels might be more important in alternative data processing.

In terms of residual batch effects that might need correction after IRS, that can happen. IRS is very specific in that it is specifically correcting the labeling reaction, LC and MS issues (basically downstream of when the reference channels are introduced). There can be upstream experimental steps where batch effects happen that would not be addressed with IRS. I always do Jupyter notebooks with a bunch of QC tests and visualization for any TMT experiment to try to see how good I think the upstream steps (sample processing) were. I have a pretty good feel for how well the digestion, labeling, and LC-MS works. Every experiment has something unique about the data and that sometimes needs some tweaks to the analysis steps. Cheers, Phil

On Aug 1, 2023, at 8:03 PM, hosseinvk @.***> wrote:

Hi Phillip,

I wonder if IRS normalisation is still valid on data with NAs (without imputation)? In that case, to get the geometric average intensity for each protein, the below modifications will be required as following:

make new data frame with row sums from each frame

irs <- tibble(rowSums(exp1_sl,na.rm = TRUE), rowSums(exp2_sl,na.rm = TRUE), rowSums(exp3_sl,na.rm = TRUE))

get the geometric average intensity for each protein

irs$average <- apply(irs, 1, function(x) exp(mean(log(x),na.rm = TRUE)))

Another point I'd like to mention is when a routine IRS normalisation is applied post imputation (unlike the above case), would it be still appropriate to implement ComBat normalisation on top of IRS-normalised data?

Thanks for your advice.

— Reply to this email directly, view it on GitHub https://github.com/pwilmart/IRS_normalization/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZNQSWJ6NSZK2ZVWTNOC2TXTG7R5ANCNFSM6AAAAAA3ATYR7I. You are receiving this because you are subscribed to this thread.

hosseinvk commented 1 year ago

Hi Phillip, Thanks for your comprehensive explanation. I have a much clearer understanding of the situation. Cheers, Hossein

Monod79 commented 1 year ago

Hi Phil,

This is an interesting question/topic. On my end I’m dealing mostly with phosphoproteomics TMT MS3 data, which involves working with phosphopeptides instead of proteins. Missing values are therefore abundant between MS runs.

On that regard, I was wondering how you address phosphoproteomics data ?

Jonathan

Jonathan Boulais, PhD Analyste de données protéomiques Proteomics data analyst

Institut de recherches cliniques de Montreal (IRCM) Montreal Clinical Research Institute 110 avenue des Pins Ouest Montreal (Qc), Canada H2W 1R7 Phone: 514 799-2137

@.***

De : Phillip Wilmarth @.> Envoyé : Wednesday, August 2, 2023 10:40 AM À : pwilmart/IRS_normalization @.> Cc : Subscribed @.***> Objet : Re: [pwilmart/IRS_normalization] IRS normalisation without imputation (Issue #5)

Hi, That is a good question. Very low abundance peptides can have some missing values across the channels and you can end up with one of the two reference channel values missing (or both missing). It is probably the right thing to average (geometric mean or arithmetic mean does not matter so much) when you have two actual measurements. When you have only one of two measurements, it is probably better to go with just the one measurement (which I think your code would do).

In my pipeline, I test the reporter ion in each scan against a minimum value. I do a trimmed average so that an intense interfering ion or a missing value are removed (mostly) and test that against something sensible. Different instruments in different modes can have different lower limits of reporter ion values. I use peak heights for reporter ions and I mostly process SPS-MS3 data from a Fusion. We see intensities down to about 300 then they disappear. Thermo subtracts noise levels from the FT analyzers, so signals do not go all the way down to zero. I usually test the trimmed average to be 500 or more. If the average is below 500, I zero out the reporter ions. That way these PSMs contribute to protein inference and protein IDs, but contribute nothing to the summed intensity for proteins. After summing to the protein level with a two peptide per protein requirement, there are still a few proteins that end up with a zero intensity total. I do a small value imputation for any remaining zero channel values. I use 150 with the Fusion data (something that is about half the smallest reporter ion peak height that is measured.

I should do a more careful treatment of missing values in the reference channels but have not had time. It is a case of a few very low abundance protein that would be affected. MaxQuant is not as sensitive as my pipeline and might have more missing values, so some explicit logic for the reference channels might be more important in alternative data processing.

In terms of residual batch effects that might need correction after IRS, that can happen. IRS is very specific in that it is specifically correcting the labeling reaction, LC and MS issues (basically downstream of when the reference channels are introduced). There can be upstream experimental steps where batch effects happen that would not be addressed with IRS. I always do Jupyter notebooks with a bunch of QC tests and visualization for any TMT experiment to try to see how good I think the upstream steps (sample processing) were. I have a pretty good feel for how well the digestion, labeling, and LC-MS works. Every experiment has something unique about the data and that sometimes needs some tweaks to the analysis steps. Cheers, Phil

On Aug 1, 2023, at 8:03 PM, hosseinvk @.<mailto:@.>> wrote:

Hi Phillip,

I wonder if IRS normalisation is still valid on data with NAs (without imputation)? In that case, to get the geometric average intensity for each protein, the below modifications will be required as following:

make new data frame with row sums from each frame

irs <- tibble(rowSums(exp1_sl,na.rm = TRUE), rowSums(exp2_sl,na.rm = TRUE), rowSums(exp3_sl,na.rm = TRUE))

get the geometric average intensity for each protein

irs$average <- apply(irs, 1, function(x) exp(mean(log(x),na.rm = TRUE)))

Another point I'd like to mention is when a routine IRS normalisation is applied post imputation (unlike the above case), would it be still appropriate to implement ComBat normalisation on top of IRS-normalised data?

Thanks for your advice.

— Reply to this email directly, view it on GitHub https://github.com/pwilmart/IRS_normalization/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZNQSWJ6NSZK2ZVWTNOC2TXTG7R5ANCNFSM6AAAAAA3ATYR7I. You are receiving this because you are subscribed to this thread.

— Reply to this email directly, view it on GitHubhttps://github.com/pwilmart/IRS_normalization/issues/5#issuecomment-1662336117, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMO7NHZSY6TJ3BRPTCLKNS3XTJRCLANCNFSM6AAAAAA3ATYR7I. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

pwilmart commented 1 year ago

Hi Jonathan, TMT labeled enriched phosphopeptide experiments are quite different from TMT-labeled relative protein abundance studies. The data volume and characteristics change a lot as you go from PSMs (scans) to peptides to proteins. Numbers of data points go from 10s to 100s of thousands, to thousands to 10s of thousands, to several hundred to a few thousand. The scatter (variance) in the data decreases in significant ways as more data is aggregated at each level. I mostly work with protein-level data.

Phosphopeptide data does not really have a protein-level view with respect to the data analysis. One should not do protein inference, protein FDR, or even aggregate data up to protein levels. These experiments are peptide-centric and require different logic for aggregating PSM data into any higher levels. Without protein inference and parsimony logic, you have much messier and more redundant data. The noise filtering associated with protein inference is under appreciated. Enrichment methods are efficient and phosphorylation may be to only specific regions of a protein, so any expectation of detecting multiple peptides per protein do not apply.

You can still improve data quality and reduce data volume (which improves the hit from multiple testing corrections) by aggregating PSMs into phosphopeptides. There are many questions about how to do that, though. What do you want to do with site localization information if you have that available? I do not think site localization algorithms (mainly their assumptions) are very good. If you ignore site localization, charge state, and nuisance modifications like oxidized Methionine, and lump together all scans that have the same peptide sequence and total number of phosphogroups, you can get data reduction factors of 2-3.

One of the secret sauce ingredients in protein-level IRS, is that there is no requirement of what specific peptides map to proteins across the plexes. As long as there are protein values and that there are protein values in the reference channels, then you can match intensities between plexes. Once you are stuck at more peptide centric levels, as you point out, there will be less consistent observations of peptides between plexes and there will be much more missing data to deal with. All that complicates multi-plex phosphopeptide studies. It is better to stay in one plex if possible. Up to 18 channels now helps.

I do not have a solution for this problem. I have been trying to make progress on that since the pandemic hit. It is hard to free up enough uninterrupted time to code something complex like that. Or maybe my older brain has lost its ability to do that level of coding anymore.

Site localization always rears its ugly head when thinking about this problem. The biological information for signaling is indexed by specific sites in the proteins. That requires consistent sets of protein sequences where the residues match the site numbers in the biological databases. I do not know if specific phosphosite biological databases provide the human protein sequences (that their site numbering is based on) in FASTA format for use in proteomics data analysis. The site numbers in the biological database have to match the protein residue numbers in the FASTA file.

Assuming you have the right FASTA sequences for the planed downstream data interpretation, how do you aggregate site localizations for a set of PSMs that might have different localizations into something to represent the aggregated peptide intensity? There needs to be some extension of site localization calculations to sets of MS2 scans. That is a big problem to solve. I think there also needs to be some fuzzy mapping of phopshopeptides with sets of possible sites into specific sites in the biological databases. This would require some serious work and thinking.

Given that the only source of experimental phosphopeptide data is bottom up proteomics data, I would think that several issues would be critical to understand and communicate for any of the biological tools to work. One is how different enrichment methods enrich different peptides. How consistent is the biological picture that emerges from the same system when using different enrichment methods? What if the interpretation of results have strong dependence on the enrichment method? Do you have to do multiple methods to get a correct picture? How do you combine data from different enrichment methods? How sensitive and reliable are different search engines at identifying modified peptides? How reliable are site localization algorithms? It is possible to have co-eluting peptides with different site localizations, too. Localizing one mod in a peptide is orders of magnitude easier to do compared to 2, 3, or 4 mods in a peptide. Can you really combine STY+80 with other mods?

I see the biggest problem in the field is that all phosphopeptide data comes from tools doing protein-level summarizations (MaxQuant, MSFragger, PD, Mascot, Scaffold, etc.). The peptide reports in those tools are all conditioned on the inferred proteins, the protein FDR, and the protein ID requirements (one peptide per protein or more than one). You get lists of peptides (and only those peptides) that are associated with the final list of proteins. There are no tools doing a proper peptide-centric data summarization (that I am aware of). If you are wise, you can tweak the protein ID/FDR parameters in these tools to produce less biased peptide lists, but I am not sure you can ever get unbiased peptide-centric lists. Cheers, Phil

On Aug 7, 2023, at 2:13 PM, Jonathan Boulais @.***> wrote:

Hi Phil,

This is an interesting question/topic. On my end I’m dealing mostly with phosphoproteomics TMT MS3 data, which involves working with phosphopeptides instead of proteins. Missing values are therefore abundant between MS runs.

On that regard, I was wondering how you address phosphoproteomics data ?

Jonathan

Jonathan Boulais, PhD Analyste de données protéomiques Proteomics data analyst

Institut de recherches cliniques de Montreal (IRCM) Montreal Clinical Research Institute 110 avenue des Pins Ouest Montreal (Qc), Canada H2W 1R7 Phone: 514 799-2137

@.***

De : Phillip Wilmarth @.> Envoyé : Wednesday, August 2, 2023 10:40 AM À : pwilmart/IRS_normalization @.> Cc : Subscribed @.***> Objet : Re: [pwilmart/IRS_normalization] IRS normalisation without imputation (Issue #5)

Hi, That is a good question. Very low abundance peptides can have some missing values across the channels and you can end up with one of the two reference channel values missing (or both missing). It is probably the right thing to average (geometric mean or arithmetic mean does not matter so much) when you have two actual measurements. When you have only one of two measurements, it is probably better to go with just the one measurement (which I think your code would do).

In my pipeline, I test the reporter ion in each scan against a minimum value. I do a trimmed average so that an intense interfering ion or a missing value are removed (mostly) and test that against something sensible. Different instruments in different modes can have different lower limits of reporter ion values. I use peak heights for reporter ions and I mostly process SPS-MS3 data from a Fusion. We see intensities down to about 300 then they disappear. Thermo subtracts noise levels from the FT analyzers, so signals do not go all the way down to zero. I usually test the trimmed average to be 500 or more. If the average is below 500, I zero out the reporter ions. That way these PSMs contribute to protein inference and protein IDs, but contribute nothing to the summed intensity for proteins. After summing to the protein level with a two peptide per protein requirement, there are still a few proteins that end up with a zero intensity total. I do a small value imputation for any remaining zero channel values. I use 150 with the Fusion data (something that is about half the smallest reporter ion peak height that is measured.

I should do a more careful treatment of missing values in the reference channels but have not had time. It is a case of a few very low abundance protein that would be affected. MaxQuant is not as sensitive as my pipeline and might have more missing values, so some explicit logic for the reference channels might be more important in alternative data processing.

In terms of residual batch effects that might need correction after IRS, that can happen. IRS is very specific in that it is specifically correcting the labeling reaction, LC and MS issues (basically downstream of when the reference channels are introduced). There can be upstream experimental steps where batch effects happen that would not be addressed with IRS. I always do Jupyter notebooks with a bunch of QC tests and visualization for any TMT experiment to try to see how good I think the upstream steps (sample processing) were. I have a pretty good feel for how well the digestion, labeling, and LC-MS works. Every experiment has something unique about the data and that sometimes needs some tweaks to the analysis steps. Cheers, Phil

On Aug 1, 2023, at 8:03 PM, hosseinvk @.<mailto:@.>> wrote:

Hi Phillip,

I wonder if IRS normalisation is still valid on data with NAs (without imputation)? In that case, to get the geometric average intensity for each protein, the below modifications will be required as following:

make new data frame with row sums from each frame

irs <- tibble(rowSums(exp1_sl,na.rm = TRUE), rowSums(exp2_sl,na.rm = TRUE), rowSums(exp3_sl,na.rm = TRUE))

get the geometric average intensity for each protein

irs$average <- apply(irs, 1, function(x) exp(mean(log(x),na.rm = TRUE)))

Another point I'd like to mention is when a routine IRS normalisation is applied post imputation (unlike the above case), would it be still appropriate to implement ComBat normalisation on top of IRS-normalised data?

Thanks for your advice.

— Reply to this email directly, view it on GitHub https://github.com/pwilmart/IRS_normalization/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZNQSWJ6NSZK2ZVWTNOC2TXTG7R5ANCNFSM6AAAAAA3ATYR7I. You are receiving this because you are subscribed to this thread.

— Reply to this email directly, view it on GitHubhttps://github.com/pwilmart/IRS_normalization/issues/5#issuecomment-1662336117, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMO7NHZSY6TJ3BRPTCLKNS3XTJRCLANCNFSM6AAAAAA3ATYR7I. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>> — Reply to this email directly, view it on GitHub https://github.com/pwilmart/IRS_normalization/issues/5#issuecomment-1668583994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZNQSQTRGEEGPNOSPK6QDDXUFK5ZANCNFSM6AAAAAA3ATYR7I. You are receiving this because you commented.

Monod79 commented 1 year ago

Hi Phil,

Sorry for the delay, but many thanks for this highly detailed answer. It’s great talking to someone that understand this topic !

Thanks !

Joe

Jonathan Boulais, PhD Analyste de données protéomiques Proteomics data analyst

Institut de recherches cliniques de Montreal (IRCM) Montreal Clinical Research Institute 110 avenue des Pins Ouest Montreal (Qc), Canada H2W 1R7 Phone: 514 799-2137

@.***

De : Phillip Wilmarth @.> Envoyé : Tuesday, August 15, 2023 2:49 PM À : pwilmart/IRS_normalization @.> Cc : BOULAIS Jonathan @.>; Comment @.> Objet : Re: [pwilmart/IRS_normalization] IRS normalisation without imputation (Issue #5)