immunogenomics / harmony

Fast, sensitive and accurate integration of single-cell data with Harmony
https://portals.broadinstitute.org/harmony/
Other
513 stars 98 forks source link

Lambda selection #232

Closed erflynn closed 8 months ago

erflynn commented 9 months ago

Re issue #157 @pati-ni - what are the updates to Harmony that mitigate the overcorrection? Also - could you describe how automated lambda selection works (e.g. when set lambda=NULL)? I've tried to understand this from the code, but am finding it difficult. The harmony_options() mentions use of tau but I also cannot find this in the code -- is this used in lambda selection or is it an alternate method? Thank you!

pati-ni commented 8 months ago

Hi @erflynn,

lambda is estimated during runtime at the regression step. When lambda=NULL, for each covariate, lambda (which sets the ridge-regression shrinkage) differs for each cluster k and batch b. Harmony uses an expectation E, which gives the expected cells in the cluster given the current size of the cluster (cells) and the number of cells in that batch. The smaller that number E, the larger lambda gets, which would shrink the correction(correct less that batch in that cluster). This tends to protect against overcorrection in some cases. For simple datasets, I would not set that parameter because we have tweaked other parts of our formula to avoid overcorrection. But if you notice the cost of the objective function increasing with harmony iterations (by setting plot_convergence=TRUE) then this automatic lambda estimation would be something to try out.

tau is used here to set theta and scale it according to the number of batches per covariate: theta <- theta * (1 - exp(-(N_b / (nclust * tau))^2))

erflynn commented 8 months ago

thank you for the explanation! this is very helpful for understanding lambda estimation and tau!

what was updated in the latest harmony versions that improves the overcorrection?

pati-ni commented 8 months ago

tweaked other parts of our formula to avoid overcorrection

As described in the manuscript, the diversity penalty now is calculated log(O+E/O) instead of log(O/E), and the optimization process is updated accordingly. If O gets small then the diversity penalty becomes very large. Small O values are especially likely when non-overlapping cell types exist in the different batches.

lambda estimation

Using the same logic this measure is applied at the regression step. Essentially, for small E's, more shrinkage is applied for a given batch in a cluster.

erflynn commented 8 months ago

that's helpful - thanks for the info! which version includes this update?

pati-ni commented 8 months ago

Version 1.2 distributed on CRAN and github master branch both implement these changes.

erflynn commented 8 months ago

great - thank you!