$\beta_m$ Ridge Regularization, and $\beta_0$ initialization value.

jgallowa07 commented 9 months ago

This issue outlines how and why we should change the model initialization defaults for the latent offset $\beta_0$, and maybe more importantly, the ridge coefficient parameter for regularizing the set of mutation effect parameters $\beta_m$.

Problem

By default, we initialize a latent offset parameter, $\beta_0 = 0$. As seen in our simulation work, this can lead to the models getting stuck in a terrible local minima when fitting as seen below.

Screenshot from 2024-01-30 12-07-18

Here, we're fitting to a simulation where the true latent wildtype phenotype has a value of $5$. It seems the problem here is that the beta's get out of control (very high values) to attempt to fit a (usually positive) wildtype latent phenotype, thus the model uses the latent offset to try and correct for this behavior and then get's stuck. The table below shows a collection of these same models fit at different initialization values for $beta_0$

Screenshot from 2024-01-30 12-23-04

We can see that there is some threshold of the $\beta_0$ initial value ("init_beta_naught") where the model begins to correctly fit (avoid that local minima) -- 0.6 in this case.

Proposed solution

Because our sigmoid is centered at 0, we often expect the latent phenotype of the wildtype to be greater than $0$, and thus it seems reasonable to set the initial latent offset value to something greater than $0$ (5?). However, this is a lazy fix, and we would like our model fitting to be more robust to the initial parameter values. As it turns out - a ridge $L_2$ penalty on the set of mutation effect parameters $\beta_m$ also helps.

Here's the same table, but with models fit to include a non-zero ridge penalty (scaling coefficient $=1e-6$), as opposed to the default of effectively no ridge penalty (scaling coefficient $=0$).

Screenshot from 2024-01-30 12-34-21

Here, we see the model correctly infers the latent offset no matter what we choose as the initial value. With no other adverse effects AFAICT, I propose we set a non-zero coefficient for the ridge by default.

( cc @jbloom @WSDeWitt @Haddox )

Haddox commented 9 months ago

Thanks for raising this issue @jgallowa07.

As you know from our previous discussions about this, I agree that this is an important problem, though I am reluctant to set a default non-zero ridge weight on the betas. My concern is that there isn't a single value that would be one-size-fits all. We expect many betas to have large negative values. And the ridge penalty needs to be weak enough to not penalize those too strongly. How weak it needs to be could depend a lot on the dataset in question. The default value we choose could be too strong in some cases, and if users don't know the ridge is there, it might impact their results without them knowing it.

What about instead including a section in the documentation about ways to troubleshoot model fitting or improve convergence where we talk about this?

jgallowa07 commented 9 months ago

That sounds fine to me. Moving this to a documentation issue and will close once included.

In that spirit, noting that the ridge seems to help stabilize and increase performance of the model's ability a little bit, at least with these simulations:

W/O ridge penalty

Screenshot from 2024-01-31 15-55-27

W/ ridge penalty

Screenshot from 2024-01-31 15-52-36

jgallowa07 commented 7 months ago

Closing this as #148 encapsulates the action we decided to take here.

matsengrp / multidms

$\beta_m$ Ridge Regularization, and $\beta_0$ initialization value. #133

Problem

Proposed solution