regularizing 3D spread of epitopes

jbloom commented 2 years ago

This issue is just bringing up point @matsen @WSDeWitt @zorian15 @timcyu and I discussed yesterday: adding the option for a regularization term that enforces the idea that sites in the epitope should somehow be in 3D proximity.

Representing structural information

We imagine two ways the structural information could be passed to the model:

A distance matrix of pairwise distances.
A 3D coordinate for each site (maybe this one is better if we take the centroid approach)?

In either case, we have to keep in mind that some sites often lack structural information for a protein as they are missing from crystal or cryo-EM structure.

What is regularization term?

There are two ideas:

Somehow penalize overall average parwise distance (or something like that) of sites that are key players in epitope.
Define a centroid for epitope, and somehow penalize the spread around this (@WSDeWitt suggested second moment).

Maybe [2] is better as it would be order L rather than L^2 if there are L sites.

In terms of how this is done, maybe we weight each site in an epitope by the average of the squares of its betas to identify sites that are big players in the epitope. Then we use this to compute the centroid of the epitope and the spread around it. One point @jbloom thinks might be important is that this spread have some L1-like property in the sense that it doesn't become catastrophic for regularization if there are two spatially distant sites for some unexpected reason, although we do think this is unlikely.

wsdewitt commented 2 years ago

For Representing structural information, another approach is to create a neighbor graph based on a distance threshold of a few Å. Sites are represented as vertices in the graph, and edges exist between vertices that represent sites that are within the threshold distance in the structure (unresolved residues would be isolated vertices, or we could try using only primary sequence neighbors for those).

There are a number of graph-based regularization approaches that we may be able to adapt for the graph representation. I think we would want to regularize such that each epitope corresponds to a subgraph with a small number of connected components/cliques (or just one) that have nonzero coefficients.

This paper describes trend filtering on graphs. It isn’t quite what we’d want, but has related ideas and citations to other stuff. http://proceedings.mlr.press/v38/wang15d.html
This paper is also doing something related, but is still not quite what we’d need. https://stanford.edu/~boyd/papers/pdf/network_lasso.pdf

matsen commented 2 years ago

Staring at the Barnes et al 2020 paper that set up the "Bjorkman nomenclature", I agree that a graph-based approach seems like a good idea. We are interested in contiguous regions of the 3D structure.

On the intuitive level, a simple penalty on pairs of sites connected by graph edges seems like a good idea. If the penalty is L1, this corresponds to k=0 trend filtering and has been called the edge lasso. If the penalty is L2 then it's the network lasso. (@WSDeWitt -- what's not to love about that?)

I note that we are thinking site-wise, not coefficient-wise right now. Jesse, you were thinking about using the max of the beta for each site being the sitewise escape value. Although max isn't differentiable, we could easily use LogSumExp instead.

jbloom commented 2 years ago

I didn't know about LogSumExp, that is cool!

Is there some equivalent of LogSumExp which would correspond to absolute values? Although it is probably not that important as I think we will almost always measure beta ~ 0 or beta > 0, there are a few cases where we can think of important mutations having beta < 0. This would mostly be if the sera were elicited against an earlier version of the virus that then got a mutation that escaped some antibodies in the sera. In that case, reversion of that mutation would have a negative beta. Whether we'd have experimental power to measure this would depend on the concentrations used in the experiment, but in some cases it could be measurable.

So although not crucial, ideally we'd somehow have a version of LogSumExp that corresponded to the max absolute value.

As far as whether to actually use a max-like function (e.g., LogSumExp) versus just the mean absolute value or mean-square beta for this 3D spread, I don't feel strongly. Probably max might be a bit better, but I doubt it would make much difference. In our DMS so far, we find that using max and mean metrics at a site tend to be pretty similar in terms of how things look. But maybe max is slightly better in principle?

As far as the whole idea of a graph-based approach, I have to admit I don't really understand it from links above. Is there a more intuitive explanation? Or maybe I just need a better conceptual understanding of how this relates to our problem. I guess I don't really get how penalizing pairs of sites connected by graph edges penalizes the spatial spread of the epitope, I think because I don't really understand the graph-based approach well.

matsen commented 2 years ago

The basics of the graph approach are:

take a 3D structure
compute pairwise distances between the atoms
build a graph in which the atoms are the nodes and the edges are the pairs of atoms that are closer than some threshold
penalize (in some way) differences between coefficients that are connected by an edge

Does that make sense? It's just thresholding distances. Note that we could have some penalty weights that could be determined by the proximity of the atoms.

jbloom commented 2 years ago

That makes sense.

matsen commented 2 years ago

We could use solvent accessibility or some such feature to encourage epitopes to be on the surface of the protein. Is this a good idea or not?

jbloom commented 2 years ago

The solvent accessibility would make some biological sense as that is usually where escape mutations are, but I'm not sure it is that needed. I feel like the experimental data will pretty directly separate which sites have actual escape and which don't. I think the main inferential complexity comes in grouping those sites with escape into epitopes. So if a site is buried and just never has any escape, I think that will be obvious.

This does flag the point that we don't want the 3D based-regularization to be so strong relative to the regularization on the beta values to artificially drag up the beta values on sites that are near escape sites but aren't escape themselves. In other words, we maybe somehow want the regularization of beta to outweigh the proximity regularization. In other words, if site 484 clearly escapes, we don't want the proximity regularization to drag up the escape of site 485 too strongly just because they are in proximity.

wsdewitt commented 2 years ago

On the intuitive level, a simple penalty on pairs of sites connected by graph edges seems like a good idea. If the penalty is L1, this corresponds to k=0 trend filtering and has been called the edge lasso. If the penalty is L2 then it's the network lasso. (@WSDeWitt -- what's not to love about that?)

I was thinking what we want is site-wise group sparsity structure that respects the graph (non-zero sites form a subgraph with 1 or few connected components). The edge lasso methods don't quite do that right? They find piece-wise constant solutions with pieces that respect the graph structure (but aren't encouraged to be sparse). The notion of "sparsistency" is on first differences, so pieces aren't encouraged to drop out. It seems like we mostly care about sparsity patterns, and not making coefficients similar between neighboring sites. I'm not sure that penalizing logsumexp will be straightforward in terms of the modification need in the prox (but maybe it is).

In fused lasso, there are two penalties: an L1 lasso term (to encourage sparsity) and an L1 lasso on first differences (to encourage the sparsity patterns to be clumpy). It seems like we want to take this idea and extend it in two directions: 1) we want L2 penalties (not squared L2) over sites to get site-wise sparsity, and 2) we want to use a graph (representing 3D neighbors) to induce spatially clumpy site-wise fusion. A subtlety with point 2 is that we don't want to fuse individual coefficients, since sites will have various amino acid preferences, but instead we can fuse their L2 norms as a weaker notion of fusion. I think this can be done with augmented Lagrangian methods (see attached scribbles). graph-fused group lasso

wsdewitt commented 2 years ago

I noticed the above can be simplified by bound constraining theta to the non-negative orthant. Then the equality constraint that theta matches the 2-norms of the site-wise betas can be written as their squares matching, and this gives the Lagrangian as a smooth function of beta (and the only non-smoothness is the fused lasso terms for theta).

matsen commented 2 years ago

This is great, Will!

I do think it's worth discussing this statement:

It seems like we mostly care about sparsity patterns, and not making coefficients similar between neighboring sites.

because I'd like to make sure that Jesse agrees here.

Along these lines, I can imagine a world in which we want to penalize the L2 difference between site-wise escape value between neighboring sites, to get a smoother transition from epitope to non-epitope escape values. For those not familiar with the fused lasso in one dimension the idea is that it encourages coefficients to be piecewise constant, like shown in black here:

jbloom commented 2 years ago

@matsen @WSDeWitt, Maybe what Will is saying makes sense. I feel like this may be best to discuss in person (ie, Zoom).

The main thing I don't understand in Will's formulation (which could just be not understanding notation) is where the penalty on large distances come in. Keep in mind the thing that we really want to penalize is two residues that are very far apart in 3D structure both making major contributions to the same epitope. I think I maybe understand (if two sites are very distant and share no neighbors, then you get a penalty of each to all its neighbors), but I'm not exactly sure.

We do want to make sure we are maintaining some notion of penalizing epitope sharing of residues that are very far apart more than sharing of ones that aren't in direct proximity but are still in the same vicinity. In other words, residues are often defined as being in contact if they within between 4.5 and 6 angstroms of each other. But it's not too surprising if an epitope sometimes shares residues that are say 15 angstroms apart if antibody is contacting two distinct regions. But it's very unexpected for an epitope to share two residues that are 80 angstroms apart.

So given above we should just get clear in our heads (and it isn't clear at least to me) if the graph approach is better than Will's original centroid second-moment idea.

As far as @matsen's question in Slack: yes, for any protein with a PDB structure we can calculate for input either:

a coordinate for each residue
a distance between each pair of residues

These coordinates / distances would presumably be per amino-acid residue rather than per atom, and there are various established methods for aggregating atomic coordinates to define amino-acid-residue-level coordinates. This processing should definitely be done outside of polyclonal on a per-protein basis, as there are all sorts of potential complications that have to be handled on a per-protein basis relating to how to handle inter-monomer distances in homo-oligomeric proteins proteins, missing residues, PDB residue numbering schemes, etc.

This last practical point about homo-oligomeric proteins makes me think that maybe a residue-distance based approach could have some advantages from a coordinate based approach. Some antibodies span multiple monomers in a homo-oligomer (most viral entry proteins are homo-trimers). So if we just define coordinates for residues in a monomer, we can't capture the oligomer symmetry properties that might make residues X and Y distant in the monomer but nearby in the oligomer, and so potentially bound by an oligomer-spanning antibody. But if we define distances we can do this, for instance by defining distance as the closest occurrence of the other residue in either the same monomer or another monomer. I'm not sure how important this point is, but would be relevant for at least some antibodies.

wsdewitt commented 2 years ago

Thanks for the comments. I’ve added a 2nd section to my notes titled “Graph Laplacian-smoothed group lasso”. This formulates a regularization approach addressing @matsen‘s point that we probably want smoothness of site-level norm within an epitope (but not necessary piecewise constant norm), and also @jbloom’s point that we should incorporate explicit distance information, not simply neighbor-ness (see attached). Combining these ideas motivates a method that encourages spatial smoothness in the site-level norms that are nonzero. notes

matsen commented 2 years ago

Very nice, Will! I look forward to discussing on Weds.

It seems to me that we can express Jesse's desired objective with regularization as wanting to have a single connected subgraph for the epitope sites. I love that you've brought in the graph Laplacian, because we can express a smooth level of connectedness using the second-smallest eigenvalue of the graph Laplacian. That doesn't mean that it's obvious how to perform optimization using such a regularization! (I poked around for other papers using the graph Laplacian for regularization and found this that has some related ideas but is about a discrete-classification problem.)

We'll discuss in person, but I'd like to make sure that we're capturing Jesse's desire for distant residues to not contribute to the same epitope. It's not obvious to me that the above graph Laplacian penalty will really mind having two smooth bumps. Like he says, it seems like a simpler regularization that just penalizes based on distances could be useful here.

I will also add to the agenda that we continue to think about how we will benchmark these various approaches. Right now if I understand correctly the code is doing a reasonable job on simulation, so how do we need to make the problem harder?

wsdewitt commented 2 years ago

Cool, thanks for the Laplacian links! Some more thoughts from today:

I don't think that connected components is what we want to constrain to get at Jesse's objective that two distant sites shouldn't both contribute. A connected component can be topologically extended and complicated, and not localized.
I agree that graph smoothing or graph trend filtering doesn't get this either, since two trend pieces or smooth bumps can be far away from each other.
Another thing is, even within a localized epitope we probably want to allow non-smooth variation within the epitope, since sites within the epitope region could vary widely (and not spatially smoothly) in their importance (another point against Laplacian smoothing).

The above got me thinking about the initial "2nd moment from the centroid" idea we discussed. This could be better, since it doesn't impose local smoothness, but does globally bound the topology of the epitope to a single region.

To experiment with these ideas, I have a colab notebook for an analogous 1D least squares problem. In this toy example I've implemented all the penalties we've discussed. My goal in this experiment is to find a penalization approach that recovers the big bump well, but ignores the small bump that is far away. The only method that works for this so far is combining lasso penalty (for sparsity) and centroid moment penalty (for spatially bounding the non-sparse component). I can get e.g. trend penalties to ignore the small bump, but this comes at the cost of losing the complex features of the big bump.

matsen commented 2 years ago

Nice, Will! I love that you're trying these out in a tractable setting.

By the way, I tried setting lambda_lasso to zero in your notebook and got results I couldn't distinguish from the nonzero version:

This was after rerunning from scratch.

jbloom commented 2 years ago

I am sufficiently behind in my conceptual understanding of the algorithms here that I just look forward to chatting in person tomorrow to ask questions and hear more!

wsdewitt commented 2 years ago

@matsen

By the way, I tried setting lambda_lasso to zero in your notebook and got results I couldn't distinguish from the nonzero version

Yeah that looks pretty good too. I guess there are no true zeros though.

matsen commented 2 years ago

I point it out because if we don't use lasso, then we don't need any prox with the latest formulation. We'll discuss.

wsdewitt commented 2 years ago

I’ve updated my penalty demo notebook to work on a 3D structure. It doesn’t work in Google Colab, so I’ve pushed it to a folder in a new branch. The protein viz widgets don't render in the GitHub view of the notebook, so you'll have to spin it up (as described in the README) to see how the epitope inference looks on the structure. Here are a few highlights:

I take a real antibody escape profile to define an epitope (which is sparse), then add noise to this to create a signal for a least squares problem.
The true epitope and noisy signal look like this on the structure:
There are a variety of penalties, including Laplacian smoothing with a graph Laplacian based on a Gaussian kernel that gives an adjacency matrix (contact map) that looks like this:
Here is a result, showing a decent recovery of the epitope:
And here is the recovered epitope on the structure:

Another thing I'd like to try is replacing the centroid-based spread penalty with a different approach that uses the adjacency matrix, and doesn't require computing a centroid at each iterate. For each site pair (i, j), we could have a penalty theta_i * theta_j / A_ij, where A is the adjacency depicted above. It seems like this would impose local clumping in a similar way as the 2nd-moment-from-centroid idea, since it will encourage either theta_i or theta_j to be zero when A_ij is small?

jbloom commented 2 years ago

@WSDeWitt, this looks pretty good to me. I must admit I don't fully understand the adjacency matrix A_ij, this is somehow different from a distance matrix, right? Is it defined over all pairs i and j?

wsdewitt commented 2 years ago

@jbloom The adjacency matrix has the same shape as the distance matrix, but the elements take values between 0 and 1 (instead of distance in Å) that represent neighborness. If we used a distance threshold, this would be a binary matrix where the elements in the distance matrix that are less than the threshold would give a 1 in the adjacency matrix (indicating they are adjacent), and the distances larger than the threshold would give a 0 (indicating they are not adjacent). Instead of a threshold, I’m using a smooth kernel exp(-d_ij / bandwidth) to get values in the range [0, 1], where the bandwidth parameter plays a similar role to the threshold as it defines the characteristic length scale for identifying neighbors.

jbloom commented 2 years ago

Makes sense, thanks.

matsen commented 2 years ago

Nice!

Would it be worth looking at the error in terms of distance to the centroid so we can see if the regularization is over-shrinking the outside of the epitope?

wsdewitt commented 2 years ago

Maybe. But I think in practice we would tune this regularization to give epitopes that look, by eye, reasonable in shape/extent. I pushed an implementation of this adjacency-based idea for the locality penalty, and it seems to perform similar to the centroid approach, but is more stable (centroid has a singularity that causes instability when the parameters have high sparsity). Here is a resulting inferred epitope using lasso penalty and this new locality penalty (noisy signal on left, inference on right):

wsdewitt commented 2 years ago

Oh maybe I misunderstood your suggestion. It could be interesting to look at residuals stratified by distance to the true epitope centroid. Is that what you meant?

matsen commented 2 years ago

Yes, that's what I meant.

timcyu commented 2 years ago

@matsen @WSDeWitt I think this graph laplacian-smooth group lasso is super cool, but I'm having trouble fully understanding it (mostly due to lack of background knowledge on the subject). I'm wondering:

Why is the augmented Lagrangian necessary for optimizing this objective? I'm confused about why we need to split it into theta and beta. The way I'm interpreting it is you're taking an unconstrained objective (Eq. 1) and turning it into a constrained problem where we want theta = beta (Eq. 3), and the augmented Lagrangian then converts it back into an unconstrained objective that encourages this equality because of the mu and rho terms that multiply the constraint function. But why can't the original unconstrained problem just be optimized on its own?
Why does the site-wise sparsity penalty go from being an L2-norm of the betas to an L1-norm of theta?

Sorry if these are dumb questions! This is all new to me but I'm curious.

wsdewitt commented 2 years ago

Thanks for the good questions @timcyu!

The problem with optimizing directly is that the L1 and L2 norms aren't smooth functions (they're not differentiable at 0), which cancels the usual guarantees of standard gradient descent. To deal with this, there are so-called proximal gradient methods, which only use the gradient of the smooth terms in the objective (e.g. the loss and squared L2/ridge penalty) to take a step, then use the proximity operator for the nonsmooth terms to project the first step back toward the constraint set corresponding to the non-smooth penalty. Here's an illustratration showing a gradient step on a smooth surface followed by projection to a square constraint region. The pair of steps—gradient step then projection—defines one iteration of a proximal gradient method. It is well known how to do this with L1 penalties (lasso) or L2 penalties on groups (group lasso), and each has it's own proximity operator to do the projection step. But in our problem, we impose site-level norm constraints, and end up with terms that are non-smooth functions of non-smooth functions of beta parameters (e.g. Eqn 4 in my notes is a smooth function of thetas, but if you substitute thetas with betas, is a non-smooth function of betas). So we don't have standard proximity operators for these penalties. The augmented Lagrangian is a dirty trick to transform the problem into one that looks, at first, to be needlessly more complicated, but allows us to use the standard proximity operators on the augmented variables.
The group lasso penalty can be written as an L1 norm of the vector of L2 (Euclidean) norms for each group (note the L2 norm is nonnegative, so the outer-level L1 norm can just add them over sites, without worrying about taking absolute values). This is intuitive, since it's like taking a standard lasso of the group-level L2 (Euclidean) norms, so encourages parameter groups to collapse to the zero vector, giving group-level sparsity. Since we require theta is the L2 (Euclidean) norms, we just have the outer L1 norm left.

Happy to discuss further if the above doesn't make sense!

timcyu commented 2 years ago

Thanks @WSDeWitt, I think this makes sense to me now. If I understand correctly, the idea is that the objective in Eq. 1 has non-smooth terms, so we need to use a proximal gradient method. However, those non-smooth terms don't have standard proximity operators. Substituting theta for beta and using the augmented Lagrangian gives us a seemingly more complicated objective, but one where all the non-smooth terms have standard proximity operators.

wsdewitt commented 2 years ago

Exactly right @timcyu!

@matsen suggested that the next stage for prototyping these spatially structured sparsity methods could be to use a site-wise model (where there is just one parameter for each site, representing any non-WT state, rather than a parameter for each non-WT AA). Although this is overly simple, it would allow us to proceed without the need for augmented Lagrangians for now, and I think @jbloom mentioned there is already a site-wise model being used as an initialization step. We could first assess performance of these regularizations in the setting of the site-wise model, then, if we like what we see, move to the full AA-wise model and the augmented Lagrangian.

timcyu commented 2 years ago

@WSDeWitt that sounds like a good idea to me!

jbloom commented 1 year ago

Addressed in #134

jbloomlab / polyclonal

regularizing 3D spread of epitopes #24

Representing structural information

What is regularization term?