Closed jgallowa07 closed 7 months ago
Thanks for raising this issue @jgallowa07.
As you know from our previous discussions about this, I agree that this is an important problem, though I am reluctant to set a default non-zero ridge weight on the betas. My concern is that there isn't a single value that would be one-size-fits all. We expect many betas to have large negative values. And the ridge penalty needs to be weak enough to not penalize those too strongly. How weak it needs to be could depend a lot on the dataset in question. The default value we choose could be too strong in some cases, and if users don't know the ridge is there, it might impact their results without them knowing it.
What about instead including a section in the documentation about ways to troubleshoot model fitting or improve convergence where we talk about this?
That sounds fine to me. Moving this to a documentation issue and will close once included.
In that spirit, noting that the ridge seems to help stabilize and increase performance of the model's ability a little bit, at least with these simulations:
W/O ridge penalty
W/ ridge penalty
Closing this as #148 encapsulates the action we decided to take here.
This issue outlines how and why we should change the model initialization defaults for the latent offset $\beta_0$, and maybe more importantly, the ridge coefficient parameter for regularizing the set of mutation effect parameters $\beta_m$.
Problem
By default, we initialize a latent offset parameter, $\beta_0 = 0$. As seen in our simulation work, this can lead to the models getting stuck in a terrible local minima when fitting as seen below.
Here, we're fitting to a simulation where the true latent wildtype phenotype has a value of $5$. It seems the problem here is that the beta's get out of control (very high values) to attempt to fit a (usually positive) wildtype latent phenotype, thus the model uses the latent offset to try and correct for this behavior and then get's stuck. The table below shows a collection of these same models fit at different initialization values for $beta_0$
We can see that there is some threshold of the $\beta_0$ initial value ("init_beta_naught") where the model begins to correctly fit (avoid that local minima) -- 0.6 in this case.
Proposed solution
Because our sigmoid is centered at 0, we often expect the latent phenotype of the wildtype to be greater than $0$, and thus it seems reasonable to set the initial latent offset value to something greater than $0$ (5?). However, this is a lazy fix, and we would like our model fitting to be more robust to the initial parameter values. As it turns out - a ridge $L_2$ penalty on the set of mutation effect parameters $\beta_m$ also helps.
Here's the same table, but with models fit to include a non-zero ridge penalty (scaling coefficient $=1e-6$), as opposed to the default of effectively no ridge penalty (scaling coefficient $=0$).
Here, we see the model correctly infers the latent offset no matter what we choose as the initial value. With no other adverse effects AFAICT, I propose we set a non-zero coefficient for the ridge by default.
( cc @jbloom @WSDeWitt @Haddox )