Choosing which antibodies to make from non-seeded clusters

psathyrella commented 7 years ago

Kristian just explained what's going on in this paper, and it turns out to be precisely what we need to do with unseeded clusters to help @lauranoges figure out which sequences should be chosen to synthesize.

Basically, we have a big clonal family, and we want to know which sequences are likely to be actually good antibodies, and there's a bunch of literature on the structural side showing that this maximum entropy approach is an effective and reasonably motivated way to do it.

He knows one of the authors (and is working on improving the method in a few ways) so will try to get a hold of something we can run when we're ready.

krdav commented 7 years ago

The paper is based on sparse validation data but it seems legit considering that they don't fit on validation data, the method is unsupervised and other papers suggest similar correlations to protein stability and function.

The model is called a Potts model and @dunleavy005 and I have be distracted from our aammp work recently because of it. Here is our paperpile on the topic: https://paperpile.com/shared/SebAmk

Among lots of junk: Ekeberg’s “Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models” is the best in term of describing the method." is a very good paper on the methods and ideas used in nearly all the other models. The “Potts Hamiltonian models of protein co-variation, free energy landscapes, and evolutionary fitness” is a nice and short review. "Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1" is an example where the sequence likelihood is correlated to thermo stability and enzyme function. "Mutation effects predicted from sequence co-variation" is an application of the original Ekeberg paper to predict deleteriousness of point mutations.

metasoarous commented 7 years ago

This is relevant to the work on #178.

psathyrella commented 6 years ago

https://www.biorxiv.org/content/biorxiv/early/2017/10/19/145052.full.pdf

Among other things, this paper makes the point that we should really be looking at selection signatures, in addition to family size and overal SHM rates. That'd be easy, right?

matsen commented 6 years ago

In general, it's quite a hard problem.

But it is "easy" in the sense that what they use is implemented as open-source software that I think works for the most part.

https://paperpile.com/shared/skuK25 is a pretty cool paper that Horns didn't apply. Simple idea, and I haven't looked at their software.

@lauranoges also had some thoughts about the Horns paper and if the same logic will transfer to our setting. @krdav thought that some of their sequence analysis was, well, clearly problematic. We'll have a good lab meeting about the paper!

matsen commented 6 years ago

@lauranoges asked if I could write out my list of components I think would be useful features in a classifier for finding interesting lineages. I'm not giving credit where it's due here, but the idea is to just put everything in a list. 📯 means that it's detailed in the Horns paper. Please comment/add.

gene usage
clonal family size
CDR3 length
pattern of hydrophobicity
amount of mutation
LONR score (imbalanced tree structure + amino acid change)
skewed site frequency spectrum 📯
evidence of selective sweeps (Fay-Wu H) 📯
persistence of clonal families through time 📯
local branching rate 📯

lauradoepker commented 6 years ago

Update: @psathyrella and I are working on/thinking about this. His plan is to create a few more plots and then run all the non-seeded analyses on the Overbaugh data sets. I'll help page through them all by eye and get a sense for the sample-to-sample variability.

@metasoarous and I decided to have CFT use cluster indexes that are numerically assigned by cluster size. i.e. the biggest cluster's index is 0 (or 1?) This way, @psathyrella and I can ask @metasoarous to run an unseeded ML analysis on our clusters of interest, specified by index.

lauradoepker commented 6 years ago

Clonal family metrics that I ultimately used to score families for their "interest level":

size (number of clusters)
mean SHM
SHM Q1 (higher = whole family has mutated, while lower = some family members are still naive-ish)
SHM variance
Fay Wu H score (measure of SFS/positive selection)
bnAb VH gene usage. I could/should have extended this to VH + VJ combination, but I didn't...

matsengrp / cft

Choosing which antibodies to make from non-seeded clusters #188