Question about handling NAs

james-cs commented 3 years ago

We are trying to implement proBatch into our data analysis pipelines, but our data has a lot of NAs. We've found that batch correction only works when we filter out peptides containing NAs which leaves us with a small fraction of our data. If we keep peptides with at most 20% NAs or at most 50% NAs, then batch effects are still seen in our data after batch correction. What is the best way to handle NAs in our data? Should they be removed before doing batch correction?

We also found that doing NA imputation with row means prior to running the PCA and PVCA (but after batch correction), then batch effects seem to disappear. Using the default -1 imputation for NAs in PCA and PVCA still shows clear batch effects in the data after batch correction. Is -1 imputation the best method for handling NAs when trying to view batch effects in PCA and PVCA?

james-cs commented 3 years ago

We've also encountered this situation where PCA appears to be batch corrected, but in the hierarchical clustering there still appear to be batch effects. Would you interpret batch correction as successful in these situations?

Here are the pre and post digestion batch corrected PCA:

And the post digestion batch corrected hierarchical clustering:

ppedrioli commented 3 years ago

We are trying to implement proBatch into our data analysis pipelines, but our data has a lot of NAs. We've found that batch correction only works when we filter out peptides containing NAs which leaves us with a small fraction of our data. If we keep peptides with at most 20% NAs or at most 50% NAs, then batch effects are still seen in our data after batch correction. What is the best way to handle NAs in our data? Should they be removed before doing batch correction?

We also found that doing NA imputation with row means prior to running the PCA and PVCA (but after batch correction), then batch effects seem to disappear. Using the default -1 imputation for NAs in PCA and PVCA still shows clear batch effects in the data after batch correction. Is -1 imputation the best method for handling NAs when trying to view batch effects in PCA and PVCA?

Missing values are tricky and I am afraid that I don't have a one size fits all answer for you. However, here are some suggestions:

Check the extent to which the missing values are driving your batch effect. Fill them with zeros and to a few diagnostic plots (e.g. hierarchical clustering, PCA, or PVCA) with matrices that contain varying fractions of missing values. This might also help you identify an acceptable threshold for filtering your initial matrix so as not too loose too many peptides while minimizing the missing value driven batch effect.
Give the NAGuideR server from the Yangsheng's lab a go. It might help you find an imputation strategy that best suits your data set (Wang, S. et al. NAguideR: performing and prioritizing missing value imputations for consistent bottom-up proteomic analyses. Nucleic Acids Research 48, e83–e83 (2020)). Generally speaking we found that it might be better to do the imputation after the batch correction. However, I would try this before and after batch correction and compare the diagnostic plots to the ones you have obtained in the previous point.
Sometimes a few bad samples can really increase the missingness. If this is the case, you might want to consider dropping those samples instead of dropping peptides.
Depending on the final analysis you need to do, you might be able to use a methodology that is compatible with missing values, in that case it might be better to not impute at all.
If you are going to do your final analysis at the protein level you could also try to reduce the sparsity of your matrix by initially keeping only the topN peptides per protein (sorted by fewest missing values and perhaps highest median intensity).

We've also encountered this situation where PCA appears to be batch corrected, but in the hierarchical clustering there still appear to be batch effects. Would you interpret batch correction as successful in these situations?

Here are the pre and post digestion batch corrected PCA:

And the post digestion batch corrected hierarchical clustering:

It would be nice not to see clusters in either of the two. You could also look at the correlations of samples before and after correction. Ideally, you want to see that intra-batch sample correlations are not higher than inter-batch correlations after correction. Also, what happens if you look at some peptide levels diagnostics rather than proteome wide ones? Do your still see a batch dependent trend in individual peptide intensities? You can also try to "quantify" the improvement by plotting the correlation of peptides from the same or from different proteins before and after correction. The idea being that the correlations of peptides from the the same protein stays similar, while the one from different proteins drops closer to zero. In the end, you might never be able to completely remove the batch effect, but you might still be able to push the contribution from relevant factors higher. What does a PVCA show you for this case? Did you succeed in pushing the relevance of the "biologically interesting" factors higher?

symbioticMe / proBatch

Question about handling NAs #6