implement epitope similarity regularization

timcyu commented 1 year ago

This PR implements the epitope similarity regularization idea in #124. It is not ready to merge yet.

I've included a test for epitope similarity regularization in test_gradients.py. Things match up well for the most part, but I noticed the difference between the numerical and analytical gradient is sometimes greater than the threshold allowed. I changed the difference allowed to 1e-6 (previously 1e-7) but unsure if this is appropriate. The gradients differ by as much as 2.4e-07 (numerical gradient: 2.4e-07, analytical gradient: 0.0).

I've also included a notebook showing the regularization helps deconvolve Frances' H3 HA mAb cocktail DMS data. This is a cocktail containing two mAbs that bind two distinct epitopes on HA. When you fit an n_epitopes=2 model with default parameters (no epitope similarity regularization), you get two identical escape profiles.

With the regularization, it's a different story! The "polyclonal serum" is successfully deconvolved.

Additionally, there's a notebook (which shouldn't be merged) that fits the same model in fit_RBD.ipynb but with a little bit of epitope similarity regularization. Interestingly, this ruins the result by making all the escape values for epitope 1 ~ 0. My guess is this is happening because many of the epitope 1 escape sites have overlapping escape values at other epitopes. This suggests to me that the default penalty should be 0 unless absolutely needed (like in above cocktail dataset).

Lastly, the equations for the penalty and its gradient are in optimization.ipynb. I think the notation could probably be improved to make things more accurate and clear.

timcyu commented 1 year ago

@jbloom I agree that $S_k$ should not depend on $i$, $j$. For the derivatives, I agree we can drop the sum over all $k$, but I don't think we can completely drop $i$, $j$ together and replace them with $e$.

In the two epitope case, I believe the derivative looks like this: $$\frac{\partial{R{\rm{similarity}}}}{\partial{\beta{m,e}}} = \lambda{\rm{similarity}} \frac{\partial{\left(\sum{m' \in S{k}} \beta{m',e}^2\right)}}{\partial{\beta{m,e}}} \left(\sum{m' \in S{k}} \beta{m',j}^2\right) + \frac{\partial{\left(\sum{m' \in S{k}} \beta{m',j}^2\right)}}{\partial{\beta{m,e}}} \left(\sum{m' \in S{k}} \beta_{m',e}^2\right)$$

$$=\lambda{\rm{similarity}} \left(2\beta{m,e}\sum_{m' \in Sk}\left(\beta{m',j}^2\right) + 0 \left(\sum{m' \in S{k}} \beta{m',e}^2\right)\right) = 2\lambda{\rm{similarity}} \beta{m,e}\sum{m' \in Sk}\left(\beta{m',j}^2\right)$$ where $j$ is the other epitope and $m'$ ranges over all mutations (including $m$) at site $S_k$.

However, when there are more than 2 epitopes, you have to sum over the $\beta{m,e}$ being multiplied by the norm of $\beta{m',j}$ at all other epitopes $j$. For example, if there are 3 epitopes, $R{\rm{similarity}}$ sums over the dot products of 3 comparisons: $(1,2), (1,3), (2,3)$. So, in the general case, perhaps the best way to write the derivative is this: $$\frac{\partial{R{\rm{similarity}}}}{\partial{\beta{m,e}}} = 2\lambda{\rm{similarity}} \sum{j \neq e} \beta{m,e}\sum_{m' \in Sk}\left(\beta{m',j}^2\right)$$

This is what is implemented in the code. Let me know if you agree/disagree!

jbloom commented 1 year ago

@timcyu, looks good. My only comment is in the derivative, move the sum over $j \ne e$ to the right of $\beta{m,e}$ since the $\beta{m,e}$ term doesn't depend on $j$.

timcyu commented 1 year ago

@jbloom it's ready for your review again!

timcyu commented 1 year ago

@jbloom addressed your comments. I decided to re-write the code using some of the matrix tricks in Will's colab notebook so it is far more efficient now. Let me know what you think.

jbloom commented 1 year ago

@timcyu, I made a few small tweaks in commit 51fcdb2 (see full commit message for details). Otherwise this looks good, but let's implement other approach (absolute value approximation: https://github.com/jbloomlab/polyclonal/issues/124#issuecomment-1272055558) and see which we prefer before deciding which pull request to merge.

timcyu commented 1 year ago

@jbloom I would suggest merging this regularization instead of the absolute value approximation. Reason is that both strategies can successfully deconvolve the mAb cocktail dataset, but the absolute value approximation is much more sensitive to the regularization weight, whereas I could reproduce the same result (shown above) with a larger range of weights with this squared euclidean penalty.

We discussed this last week and I've included the relevant slides here for reference. Let me know if you have any final thoughts!

jbloom commented 1 year ago

@timcyu, can you experiment on what happens if you drop reg_activity_weight a lot, maybe even pushing it to 0? Particularly on the simulated RBD dataset. A lot of the issue seems to be that the typical non-epitope site is being fit to something more like ~0.5 than 0, which may be why you lose the third epitope. I wonder if that is driving by the regularization on activity which is moving some of that value into the escape values.

timcyu commented 1 year ago

@jbloom dropping reg_activity_weight has no effect on both the simulated RBD and mAb cocktail datasets.

jbloomlab / polyclonal

implement epitope similarity regularization #128