Overfitting data? - Githubissues

In your statistical_testing analysis, following IRS normalization, the result of the decide tests function looks you might be overfitting your data.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4679072/ https://www.biostars.org/p/145083/

Hi, Thanks for the links and question. IRS is not the same as typical genomics batch corrections. The data (the pooled standard channels in a properly designed experiment) used for the corrections is not used in the statistical testing. I think that removes the potential confounding of batch effects and the biological effects. IRS is specifically removing the random MS2 sampling effect from the data. That is a little different than the usual set of assumptions made for the types of batch effects seen in genomics data. Cheers, Phil

Hello,

Agree it's a different source of variation and in our hands, IRS appears to be an efficient batch correction technique for TMT data.

From a stats point of view, forgetting sequencing vs MS, if you perform differential analysis on the post-IRS data, which is removing a coefficient presumably for the MS2 sampling variation between batches, then it appears that the relevant statistic would be overestimated (with underestimated p-values), in the case where you're using limma or similar, since you're not removing an additional degree of freedom (unless I'm missing where that's occurring). You can see this is different if you take the normalized (but not IRS-batch corrected) data and use the IRS coefficients as covariates with a linear model of ~disease+batch (vs just ~disease with the IRS-processed data).

Seperately, if it's only MS2 sampling variation, is the IRS-batch correction not needed if you have sufficient technical replicates? Do you have a dataset with even n of 2 or 3 technical replicates to see if the data trend that direction?

Could some of the variation be due to labeling differences in the batches? This might not be completely random if there are subtle differences in sample preparation for the different samples prior to labeling.

Best,

Hi, I guess it comes down to how much you trust statistical models. I like to decouple the systematic measurement errors (the normalization parts) from the DE testing. I like to be able to visualize what exact effect the normalization corrections do to the data and that the data is appropriate for the statistical model assumptions. I believe the models can do the basic differential expression factors as functions of biological conditions with a random error term. That modeling is usually done independently for each gene. Many of these other factors (normalizations, batch effects, etc.) need information from multiple genes (or all of the genes). I worry whether situations that are not row-by-row are implemented correctly in the general linear modeling frameworks in R.

I have seen some threads where it is debated whether tools like Combat should be used first, or if the batch effects should be in the statistical model. Those arguments can probably be generalized to all normalization steps. I would guess that some datasets work better one way, and others work better the other way. Determining "working better" can be hard. These might be finer points and small differences. From an algorithm viewpoint, IRS is different from other normalization and batch correction methods. If IRS were modeled correctly, I imagine it could be incorporated into statistical models.

Technical replicates could serve the role of pooled internal standards and possibly be used to do a similar correction for the MS2 sampling effect. We have found that 2 out of 10 or 11 channels is a good compromise. Having two versus one channel for determining correction factors is a big improvement and allows for some QC checks as well. An advantage of the pooled sample approach is that some biological samples might not have all of the proteins present. It is better to have valid signals for all of the proteins to avoid correction artifacts.

There are lots of issues when doing bench work in parallel on larger numbers of samples. There are many correction and QC steps in these protocols. We do the best we can with protein assays, digestions, and labeling. We check how well everything worked at the end by taking a little of the labeled samples, mixing one-to-one, and doing a short TMT run. We then check the total intensities (excluding contaminants) per channel to see that they are similar (the same total amount of protein was supposed to have been labeled in each channel). We also do a variable modification with the TMT tags to check that labeling efficiency is where it is supposed to be and that it is consistent across all channels. If the labeling looks funny, we will redo it. We use the per channel intensity totals to adjust how much of each channel we mix for the actual TMT experiment. Those volume corrections are not necessarily minor. We do not understand why they vary as much as they do. We have seen this from the beginning (we have been doing these TMT experiments from 2015). Some types of samples seem more consistent, and others are harder to work with. We also do different types of digestion protocols depending on samples. We have tried different protein and peptide assay kits. The effect seems to persist. These realities of the measurements are why I do not see too much point in arguing over statistical model details. The data are not really as “ideal” as the mathematical models assume. I think a lot of data visualization and sanity checks are required to have confidence in the experiments and in the differential expression candidate calls. Cheers, Phil

Hi, Here are some reasons that IRS is not like Combat. Combat and many other batch correction algorithms rely on a balanced allocation of biological samples to each batch. Basically, each batch is a subset of the samples, and it is assumed that some fair splitting up of the samples into the subsets was done. The main assumption is the each subset of samples, on average, will be "the same". The most common way for batch corrections to fail is if the allocation of samples was biased (unbalanced), or if the variability of samples within the same condition is large (then the samples of the same condition may not be similar between batches).

These kinds of batch correction methods will be more robust if the number of samples in each batch is larger so that multiple samples from all conditions can be included. TMT is limited in terms of sample numbers per batch (10 or 11 at most) so these methods may not be the best.

The thing these algorithms need is some definition of "the same" for each batch so that they can adjust the data towards the "same" value. The average (per gene) across each batch is what is often used and can be OK if there are sufficient samples and the conditions per batch are balanced. IRS just takes this idea of "the same" to its logical conclusion.

IRS explicitly puts the same data in each batch; namely, the duplicated pooled internal standard channels. There is no assumption of what constitutes "the same". The key thing with TMT data is that the RELATIVE intensities of the channels from the same scan is what is measured. The actual intensity scale for the scan is arbitrarily scaled by the instrument (calibration, tune, how clean, AGC target, AGC fill time, etc.). The intensity is also affected by when the analyte was selected for MS2 (low or high on the elution profile). MS2 selection is not really a completely random process but it has some similarities.

IRS simply aligns the intensity scales, for each protein, between the TMT experiments based on the identical pooled standard channels. If you measure exactly the same thing, then you should get exactly the same number. The same intensity scale correction gets applied to all of the channels in each TMT experiment so the biological samples get put on an aligned intensity scale between TMT experiments, while maintaining the relative intensities. There is no variance reduction aspect like in Combat. The data used for the corrections are independent of the data later used in the statistical testing. I do not think that there is any potential for over fitting because there is not really any fitting being done. It is a deterministic correction factor not a fitted parameter.

This makes sense to me (n=1), but I could be missing something. Cheers, Phil

I think we agree on many points, especially around the technical difficulty of handling data analysis of these workflows. I'll just point out the following:

I guess it comes down to how much you trust statistical models.

This may not be the most productive way to ask the question. Assuming an understanding of the statistical model, a better question might be, how well does the model describe these data and the analysis?

I agree de-coupling the normalization and DE is good. It would be very good if you moved to considering normalization and batch correction as two separate steps. Where normalization is adjusting global intensity distributions only.

A fundamental problem with your method as you put forward is, in using the IRS correction factor, you may be removing variation that is not simply due to batch variation (see previous citations), which, by definition will result in overfitting your data and overestimating confidence in differences between biological conditions.

I suggest finding a collaborator with a strong background in the math/stats side of things and together coming up with a plan to refine and validate your method. Considering what datasets would be needed, how things change with technical replicates (and how to model this change in the case of a small dataset), and balancing sample requirements across TMT sets.

Your reasoning, while not invalid, is the same that lead to overfitting data in the microarray/sequencing space. Your method does not currently address/model this concern. As such, it should be used with caution and probably not used before differential analysis.

Hi, I am glad that we have some agreement on these points. I think one area that needs clarification is that random sampling (of MS2 scans) makes different TMT experiments appear similar to batches in genomics datasets (from MDS and PCA plots). However, random sampling is not really a batch effect. Data dependent acquisition is not truly a random sampling situation, but it seems to be similar {Liu, H., Sadygov, R.G. and Yates, J.R., 2004. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Analytical chemistry, 76(14), pp.4193-4201.}. Batch effects as discussed in {Nygaard, V., Rødland, E.A. and Hovig, E., 2016. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics, 17(1), pp.29-39.} are typically systematic effects.

The developmental lens study that I used in my Github analyses was not a properly designed experiment for IRS. There were no identical pooled internal standards present in each TMT experiment. The pooled standards are very important. The assumption (and calling it an assumption is probably not fair) is that if you measure the exact same thing multiple times, you should get the same number. That is what measuring anything should do. The IRS method just makes that true for multiple TMT datasets. It is an exact correction factor for the specific, well-understood phenomenon of data dependent acquisition used in shotgun proteomics.

In the 2017 MCP paper {Plubell, D.L., Wilmarth, P.A., Zhao, Y., Fenton, A.M., Minnier, J., Reddy, A.P., Klimek, J., Yang, X., David, L.L. and Pamir, N., 2017. Extended multiplexing of TMT labeling reveals age and high fat diet specific proteome changes in mouse epididymal adipose tissue. Molecular & Cellular Proteomics, pp.mcp-M116.}, we validated IRS by splitting the two pooled standards within each TMT into normalization vectors and validation vectors. We used a single standard channel in each TMT (instead of the average of two) to compute the IRS factors and then tested how “identical” the validation channels were before and after IRS. That work is in the Supplemental materials. I think that is about as robust a validation as you can design.

I will mention again that there is no variance reduction aspect to IRS (in contrast to Combat). If you tally characteristics of single TMT experiment datasets, such as CV distributions within biological conditions for different types of samples, you get a baseline for TMT performance for the MS3 SPS acquisition method. Better samples (yeast and cultured cell lysates) achieve CVs of 10%. More difficult samples (like fat deposits) will have higher CVs more like 20%. It is clear that multiple TMT experiments without IRS are not like single TMT experiments (60% CVs are more akin to spectral counting). With IRS, multiple TMT experiments have characteristics much more similar to single TMT experiments. You do not get “better looking” data after IRS than you get from single TMT experiment data. I have never seen any indications of over correction in the numerous datasets we have generated in the nearly 4 years that we have been doing these types of experiments.

I put the notebooks up on Github is so that any and all sanity checks (that I could think of) on these types of data could be shared, and that the data could speak for themselves. I am skeptical of statistical modeling because I have extensive experience with modeling of nuclear spectroscopy data and know first-hand the limitations of parameter estimations.

If one were to remove batch effects externally and then apply a statistical model that included batch factors, that would lead to overfitting because there will be too many terms in the model. If you remove batch effects first and then use a model that does not have batch factors, then there is no possibility of overfitting. (It can still be a poor model for the data, but that is a separate issue).

Your initial post suggests that you thought that there might be too many DE candidates after IRS for the developing mouse lens. That is not true. The lens is a very well-known system. In more mature lenses (like P30), the dozen or so crystallins will make up 90% of the wet weight of the lens. Even in new born mice {Ueda, Y., Duncan, M.K. and David, L.L., 2002. Lens proteomics: the accumulation of crystallin modifications in the mouse lens with age. Investigative ophthalmology & visual science, 43(1), pp.205-215.}, one can only detect crystallins with 2-D gels. The correct DE pattern is a small number of up-regulated lens-specific proteins and EVERTHING else downregulated. I used the lens study despite its lack of pooled standards because it is the closest thing to a gold standard for a multiple TMT DE experiment that is not overly artificial.

It is nice that you do not think that I have much background in mathematics or statistics. I wanted the notebooks to be more accessible to scientists rather than statisticians. I have a Bachelor’s degree in mathematics (along with chemistry and physics) and did my Ph.D. research in experimental nuclear physics. Cheers, Phil

First, I’m sorry I wasn’t clearer in my last post. I did not mean to imply that your training is not excellent. Rather, I meant to suggest that the method you are developing may not be appropriately accounting for variances that are VERY difficult to model. I often find such challenges are great opportunities for collaboration.

Second, your commitment to open source is excellent and I hope others will model it.

I think maybe this is key to where we disagree:

“The assumption (and calling it an assumption is probably not fair) is that if you measure the exact same thing multiple times, you should get the same number. That is what measuring anything should do.”

Repeated measurements of the same thing should not yield the same number. They should sample from the same distribution. This distribution will be a function of all the sources of biological and technical variation in a given system.

Let’s suppose a hypothetical experiment:

You have 20 samples that you run on two TMT 10-plex runs: A and B. The 20 samples are well balanced across the two TMT 10-plex runs for biological conditions. You state:

“random sampling (of MS2 scans) makes different TMT experiments appear similar to batches in genomics datasets (from MDS and PCA plots). However, random sampling is not really a batch effect.“

It does appear IRS reflects this assumption, that is, it assumes all the variance between A and B is a result of MS2 sampling. However, I do not think that is the only source of variation between A and B. Different labeling reactions, different biological replicates, etc, will also contribute to variation (but are being removed – hence the overfitting).

Let’s say we can have an extended experimental design. You have sufficient sample to run each of the pools (A and B) with 3 technical replicates (A1, B1, A2, B2, A3, B3).

You then repeat that exact design a month later, and then a third time a month after that. So, you end up with:

T1: A1, B1, A2, B2, A3, B3 T2: A1, B1, A2, B2, A3, B3 T3: A1, B1, A2, B2, A3, B3

I would expect there to be batch effect between the three time points (T1, T2, T3), as well as the different TMT-labeling experiments (A and B). In the extended experimental design, you could model variance as a function of MS2 sampling (by comparing technical replicates: A1, A2, A3), you could model variance as a function of TMT reactions and biological replicates (A, B), and you could possibly estimate batch (T1, T2, T3).

It is excellent you are accumulating data sets. One of the difficulties I think we agree on with TMT/MS proteomics is the issue of cost and bandwidth for generating even modestly sized data sets. As you continue to accumulate datasets, I hope you will consider the above as you consider ways to refine and validate your method.

I will mention again that there is no variance reduction aspect to IRS (in contrast to Combat).

I'll add that the variance reduction part to Combat is important. This is compensating for the fact that not all the variation being removed is a result of the "batch" and limits overfitting of the data. This is an estimation that is particularly important for relatively smaller datasets.

Hi, You are certainly correct that no two measurements will ever be exactly the same. That does not mean that what is being measured is not exactly the same. You are also correct that there are multiple sources of variation associated with any measurement and that any normalization likely corrects more than one source of variation.

Consider Western blot analysis where a housekeeping gene is used to normalize the measurements. You can think of the pooled standards channels in IRS like customized housekeeping genes for each protein. Different TMT experiments are like different blots and the pooled standard channels are the loading controls. Uncertainties in the housekeeping gene measurements propagate into the measurements of the genes of interest so things are not perfect. That is the main reason we advocate for duplicate pooled standards in each TMT experiment (to balance accuracy versus throughput).

A SILAC analogy is also relevant. SILAC has high precision because everything that affects the light channel equally affects the heavy channel. The TMT channels are like that. Whatever affects the pooled standard channels also affects the biological sample channels (all measurements are within the same scan). Combining PSMs into aggregated protein totals preserves all of the relative precision, and the summing reduces the variance of the reporter ion signals.

My discussions of IRS do simplify some things to emphasize the major points about the key concepts. The notebooks are already pretty long and more detail would probably not improve them for the intended audience (graduate students and post docs doing the data analysis). I think that better science is doing as little to the data as possible to get a convincing answer. There are too many assumptions of complexity to justify overly complicated analyses and models. I feel that you have to prove that a simple approach is inadequate before you can justify doing something more complicated. That is just the way that I was trained to think about science many, many years ago. Cheers, Phil

I agree simplicity is often more elegant. However, when proposing a new method, one has the responsibility to validate the method; often, in my experience, this is the most difficult step.

Considering this discussion, your publication, and my first reference in the comment, it seems most likely that your technique (1) effectively removes batch effect for visualizing TMT datasets and (2) overfits the data by underestimating error and leads to inflated confidence estimations in downstream differential analysis. This doesn't mean it's wrong, it just means it should be used with caution and using FDR thresholds for DE (e.g. claiming X proteins differential with FDR < 0.05) is not appropriate. If we are not in agreement on these points, then I would encourage you to look closely at the math discussed in my first reference (ignore for a second the type of data) and focus on the discussion of batch and error estimation.

I look forward to watching your method evolve - there is a clear need in the field.

Hi, The Nygaard, et al. paper was interesting but mostly not relevant to IRS. As I have tried to explain several times, IRS with pooled internal standards is not the same as genomic batch corrections. IRS would take the data in Figure 1B and get back to something nearly the same as Figure 1A. You do not get situations like Figures 1C-D.

In fact, IRS is not affected by a balanced study design, or by an unbalanced study design. The data for the correction is different data from the biological replicate data. That is a fundamental difference between IRS and typical batch correction.

The conclusions in the aforementioned paper are that batch corrections for balanced study designs generally works fine, but there can be problems if the study designs are unbalanced. The developing lens data that I used as an example is a perfectly balanced study design. The issues you are worrying about would not happen in that data. It is clear from the CV distributions comparing Combat corrected data to IRS corrected data, that IRS is working much better than Combat. This should be a nearly ideal case for Combat. The CV distributions have a clear trend with development time, and the MDS plots do not segregate by time point as well as they do after IRS.

I have read every proteomics TMT/iTRAQ paper on statistical methods and data normalizations that I have been able to find, and I have been searching high and low since the fall of 2014. I have also read many batch correction papers (nearly all in genomics as I cannot find much in proteomics). I have done far more validation, explanation, and applications to a wider variety of datasets that I have seen in any other publication. The pooled standard channels are technical replicates and the validation exercise we did in the 2017 MCP paper is far better than artificial mixtures with duplicated channels.

There is no evidence that IRS over-corrects data (over-fitting is not being used properly – that is defined as more adjustable parameters in a model that data points. There is no modeling associated with IRS and no fitting.). The decideTests result after IRS of 2629 down and 121 up candidates is the more correct answer for the developing mouse lens. IRS also has nothing to do with data visualization. If visualizations look better after IRS, it is because the underlying data has been properly corrected. Cheers, Phil

Hi, We just completed a large experiment that had some technical replicates to play with. I added a new repository with an analysis of the technical replicates: https://github.com/pwilmart/IRS_validation.git

There is an HTML rendering of the notebook that can be downloaded. The client has not seen the data yet so it cannot be shared. The analysis does not really depend on the experiment or give away any potential results so sharing this analysis should be OK.

I hope this data addresses some of your issues and concerns. If you have any questions, post here and I will try and answer them. Cheers, Phil

pwilmart / IRS_normalization

Overfitting data? #2