pwilmart / IRS_normalization

An exploration of internal reference scaling (IRS) normalization in isobaric tagging proteomics experiments.
MIT License
15 stars 9 forks source link

Applying SL, TMM, and IRS on log2 data or raw data #4

Open lisiarend opened 1 year ago

lisiarend commented 1 year ago

Hey, I have a question regarding the three normalization methods that you applied. Actually it makes a difference whether to apply for example SL on raw data and log2 the data for visualization, or to apply SL on already log2 transformed data. The same results for TMM and IRS.

What is best practice for this? And why would you (as in your markdown) use these methods on raw data?

Best, Lis

pwilmart commented 1 year ago

Hi Lis, I don't work with log2 data because mathematical operations on logs are not the same as operations on non-logged values. For example, averages of non-logged values are simple averages. Averages of log2 values are geometric means. Internally, routines like TMM might be working with logged values, so you do not want to pass in data that is already logged. You are correct that the normalization methods would not give the same results with logged data. All those normalization methods assume the data is in its native (linear) scale.

I may use log scales in plots but I try to keep data in its natural scale as much as possible. I also try to avoid ratios (which usually also need a log2 transform). Both logs and ratios change the mathematical space of the numbers and our brains do not mentally visualize those spaces correctly. Our intuition with numbers really only applies to linear numerical spaces.

I think some of the reasons you see a lot of log transformations in R scripts is related to parametric statistical modeling. It is often assumed (i.e. not tested) that data is not normally distributed and that the logged data might be. The argument tath the data is not Normal is usually based on the distribution of all data values in a genome or proteome (the full dataset). The statistical modeling is applied per gene/protein and it is the distribution of the values for single proteins/genes that should be normally distributed. The distribution of all proteins or genes is irrelevant. Cheers, Phil

lisiarend commented 1 year ago

Okay, thank you very much.

I am currently working with the multiple normalization methods of proteomics data and I am evaluating those on multiple datasets. Therefore it is important to know which method requires log2-transformed data and which does not. And this isn't always that easy, but thanks very much for your quick response:)