Epitopes are questionably separated despite the epitope similarity penalty

caelanradford commented 1 year ago

After trying out some different values for different fitting parameters while using the epitope similarity regularization penalty, I am still having issues separating escape effects into believable epitopes. Epitopes being fit are no longer mirrors of each others' effects, but now do not make much intuitive sense. For example, mutations at the first residue of glycosylation sites are being put in one epitope, while mutations at the third residue in the same glycosylation site (which are essentially the same mutation, knocking out the glycan) are being put in a second epitope. I am also having cases of 5 or so sites in a row in linear sequence causing escape, but each site being seemingly randomly put in a different epitope depending on parameters chosen, when we would really expect most of these are probably the same epitope.

This is likely an artifact of how the mutagenesis was done; the mutagenic primers used will overwrite adjacent mutations in subsequent rounds of PCR, so few variants have mutations close in linear sequence. This should make it difficult for polyclonal to determine these mutations are in separate epitopes.

This could be improved by either penalizing residues in close proximity either in linear sequence space or tertiary structure for being in the same epitope. Either of these methods would have drawbacks. Proximity in linear sequence obviously differs greatly from proximity in the tertiary structure, but for some proteins portions of the protein are often not visualized in structures. This could be especially problematic for HIV Env where variable loops are often not visualized, but often very important for antibody escape. Linear sequence proximity penalization would probably be easier to implement and would probably somewhat alleviate what I am seeing happening.

jbloom commented 1 year ago

Can you add the specific details?

caelanradford commented 1 year ago

For this I am mainly looking at the IDC508 serum. In the current version of my analysis, I have raised the reg_escape_weight to .5 and the reg_similarity_weight to 2, which makes the epitopes more cleanly separate, but results in only a couple of mutations meaningfully contributing to one of the epitopes. If these parameters are lowered much at all, then often site 276 and 278 are put into separate epitopes, which does not make any sense because they are essentially the same mutation (deleting the glycan). Sites 202 and 203 often also then end up in separate epitopes, which also does not make much sense.

Although the current parameters do make the epitopes separate more cleanly, they are still not too believable to me, because most escape mutations are still being grouped with the N276 glycan knockout, which I would expect to be the strongest escape signal of an antibody recognizing it, rather than comparable effects across all escape sites. The most believable epitopes would be one where the N276 knockouts have high effects along with a few other sites with low effects, and one that looks more 1-18 like with low levels of escape across the protein. Of course, what I think is going on might not be going on at all. It will be hard to tell until I do some validations.

jbloom commented 1 year ago

I'm working on this. #134 should be first step, and I will elaborate to this use case soon.

matsen commented 1 year ago

💥 I can't wait to hear how this works out.

jbloomlab / polyclonal

Epitopes are questionably separated despite the epitope similarity penalty #132