symbioticMe / proBatch

Tools for Batch Effects Diagnostics and Correction
15 stars 6 forks source link

Question about handling NAs #6

Open james-cs opened 3 years ago

james-cs commented 3 years ago

We are trying to implement proBatch into our data analysis pipelines, but our data has a lot of NAs. We've found that batch correction only works when we filter out peptides containing NAs which leaves us with a small fraction of our data. If we keep peptides with at most 20% NAs or at most 50% NAs, then batch effects are still seen in our data after batch correction. What is the best way to handle NAs in our data? Should they be removed before doing batch correction?

We also found that doing NA imputation with row means prior to running the PCA and PVCA (but after batch correction), then batch effects seem to disappear. Using the default -1 imputation for NAs in PCA and PVCA still shows clear batch effects in the data after batch correction. Is -1 imputation the best method for handling NAs when trying to view batch effects in PCA and PVCA?

james-cs commented 3 years ago

We've also encountered this situation where PCA appears to be batch corrected, but in the hierarchical clustering there still appear to be batch effects. Would you interpret batch correction as successful in these situations?

Here are the pre and post digestion batch corrected PCA: image

And the post digestion batch corrected hierarchical clustering: image

ppedrioli commented 3 years ago

We are trying to implement proBatch into our data analysis pipelines, but our data has a lot of NAs. We've found that batch correction only works when we filter out peptides containing NAs which leaves us with a small fraction of our data. If we keep peptides with at most 20% NAs or at most 50% NAs, then batch effects are still seen in our data after batch correction. What is the best way to handle NAs in our data? Should they be removed before doing batch correction?

We also found that doing NA imputation with row means prior to running the PCA and PVCA (but after batch correction), then batch effects seem to disappear. Using the default -1 imputation for NAs in PCA and PVCA still shows clear batch effects in the data after batch correction. Is -1 imputation the best method for handling NAs when trying to view batch effects in PCA and PVCA?

Missing values are tricky and I am afraid that I don't have a one size fits all answer for you. However, here are some suggestions:

We've also encountered this situation where PCA appears to be batch corrected, but in the hierarchical clustering there still appear to be batch effects. Would you interpret batch correction as successful in these situations?

Here are the pre and post digestion batch corrected PCA: image

And the post digestion batch corrected hierarchical clustering: image

It would be nice not to see clusters in either of the two. You could also look at the correlations of samples before and after correction. Ideally, you want to see that intra-batch sample correlations are not higher than inter-batch correlations after correction. Also, what happens if you look at some peptide levels diagnostics rather than proteome wide ones? Do your still see a batch dependent trend in individual peptide intensities? You can also try to "quantify" the improvement by plotting the correlation of peptides from the same or from different proteins before and after correction. The idea being that the correlations of peptides from the the same protein stays similar, while the one from different proteins drops closer to zero. In the end, you might never be able to completely remove the batch effect, but you might still be able to push the contribution from relevant factors higher. What does a PVCA show you for this case? Did you succeed in pushing the relevance of the "biologically interesting" factors higher?