Optionally enforce zero observed phenotype for WT in global epistasis model

wsdewitt commented 2 years ago

Observed phenotype data is sometimes in the form of a delta from WT, e.g. _delta log10(Ka) for binding affinity of variants with respect to the WT sequence. In these cases we'd like to enforce the global epistasis function g() to map the WT latent phenotype phi(0) to zero. Is this possible currently?

Since things are rescaled in the end such that phi(0) = 0, this could be interpreted as enforcing zero y-intercept on g(). Without this option, the model has slightly too much freedom when data are referenced to WT, and we end up predicting nonzero observed WT phenotype when it should be exactly zero by definition.

One easy approach to dealing with this could be to have an optional flag like wt0, and compute the model as

p(v) = g(phi(v)) - g(phi(0)),

revising gradients accordingly. One could complain that this model is now invariant to translating g() (adding any constant to g gets lost in the subtraction above). I guess this wouldn't confound optimization, but one could also add a small penalty on g(phi(0))^2 to remove any possible issue.

cc @jgallowa07

jbloom commented 2 years ago

I actually don't remember the detailed implementation enough to really comment intelligently on this off the top of my head. But I think you are right that the current approach simply re-scales at the end as described in the last paragraph of the Inference of Global Epistasis subsection of the Methods of Otwinoski et al.

I guess I don't fully understand how enforcing the wildtype during optimization is different than just re-scaling afterwards? Or is the difference that you want to enforce both the observed and latent wildtype phenotype to zero?

If the latter, the suggestions you make sound good. I do think that the Kinney / Otwinoski labs may have published more software in this area too, so you might also check if it's just possible to already do what you suggest using their software.

Also, if you end up making large-scale modifications, you might at least consider splitting the global epistasis fitting entirely out of the dms_variants. Initially I built it into dms_variants because I saw it all part of the same experiments. But in reality dms_variants is now an uncomfortable mix of the code for processing actual raw data and then fitting global epistasis models to processed data. This is unwieldy, and it would really be better to move the global epistasis to its own package. However, that would be quite a bit of effort, so is probably worth it if you are modifying the code a lot but not if you are just modifying it a little.

wsdewitt commented 2 years ago

To answer your clarification question, we weren't needing to enforce the WT during optimization, but we do want to enforce the observed WT phenotype to be zero when data is in the form of delta wrt WT. @jgallowa07 and I just discussed this, and I think we have two workarounds that don't require modifying dms-variants (so I will close issue):

Provide repeated dummy data entries for the WT with observed phenotype at 0. This lets the model learn that the WT should be at 0 observed phenotype. It's not perfect, since the prediction won't be exactly zero, but with more dummy entries, it will get arbitrarily close.
Fit the model on the raw scores (i.e. logKa instead of delta logKa) including the WT entires, then to make predictions for delta logKa we can compute: prediction(v) = p(v) - p(WT).

jbloomlab / dms_variants

Optionally enforce zero observed phenotype for WT in global epistasis model #80