Open EvaFlower opened 11 months ago
Hi, LamBO-2 works fine for me when I replicate their sasa experiments. You need to first train the diffusion model with guidance (see last 3 parameters below) change "OBJECTIVE NAME 1" to "sasa" for their example
PYTHONPATH="." python scripts/train_seq_model.py \ model=[MODEL TYPE] \ model.optimizer.lr=[MODEL LR] \ data_dir=[DATASET DIRECTORY] \ train_fn=[TRAINING CSV FILE] \ val_fn=[VALIDATION CSV FILE] \ vocab_file=[VOCAB FILE IN THIS REPO'S BASE DIR] \ log_dir=[LOGGING DIRECTORY] 'target_cols=[[OBJECTIVE NAME 1], ..., [OBJECTIVE NAME K]]' \ model.network.target_channels=[K] \ discr_batch_ratio=[RATIO OF GENERATIVE LOSS UPDATES TO DISCRIMINATIVE] \
Then, you can select one of your model weights checkpoints, and do the LamBO-2 sampeling, which optimises for your model's objectives PYTHONPATH="." python scripts/control/sample_diffusion.py \ model=mlm \ model.network.target_channels=1 \ ckpt_path=[CKPT PATH, MODEL TRAINED FOR BETA SHEETS] \ +guidance_kwargs.step_size=1.0 \ +guidance_kwargs.stability_coef=0.01 \ +guidance_kwargs.num_steps=10 \ +seeds_fn=[PATH TO poas_seeds.csv] \ +results_dir=[RESULTS DIR] \
@EvaFlower We are planning to release updated LaMBO-2 code through the Prescient design repository before the paper's presentation at NeurIPS. This new codebase will include the full functionality for partial deep ensembles and the multi-objective acquisition functions described in the paper. Open source moves a bit slower in biotech because of intellectual property concerns. Thank you for your patience!
@ngruver Is LaMBO-2 not fully functional at the moment if one uses train_seq_model.py and subsequently sample_diffusion.py on ones own objectives? If not, what parts are missing?
Yep, it is fully functional to run on your own objectives, and you could simply work with your own implementation of multi-objective acquisition functions, or optimize a simple linear scalarization in order to optimize multiple objective. You should be able to reproduce the experiments with the in silico objectives. The update will provide the additional primitives for multi-objective Bayesian optimization described in the paper, which were used in the in vitro antibody design experiments.
Thanks a lot for your reply. Lets say one has only a small dataset for antibodies with objective scores. Is it possible to train the unguided infilling using your antibody sequences and then subsequently expand the model with the regression head and fine tune to ones antibody sequences with objective scores?
Yes that should work!
About the coming update: Are the independent discriminative model, which you used to filter the 30K LaMBO-2 suggestions down to 68, going to be disclosed in the NeurIPS update? Also, can you say, how many sequences you had in the original seeds pool you used in the Antibody Lead Optimization? (Or was it only 3 as signified by the 3 "x" signs in the plot?)
Sadly the ranking models were trained on data internal to Genentech, and therefore we will not be able to release the weights. We will try to add a more in depth description of how these models were trained in our next arxiv update. Re: seeds, that is correct. You can also take a look at Figure 15 in the current version on arxiv: https://arxiv.org/pdf/2305.20009.pdf
I'm sorry for the repeated question, but has the complete version of the LaMBO-2 code already been released? If it has been released, could you please provide the URL to the repository?
@samuelstanton
The question was insufficient. I'm very interested in the complete code of LaMBO-2 as mentioned in this comment (especially the combination of the diffusion model and partial deep ensembles) Are you suggesting that we can supplement the missing parts of the current code by referencing LaMBO?
Apologies. I tagged Sam because he is actively working on the release of the LaMBO-2 code and could give you an update. I believe he is on track to release it in upcoming week or two, but I'm not really sure, as the repo will be the intellectual property of prescient design, and I am not an employee.
I apologize for the misunderstanding. I appreciate your generous response.
No worries at all, and thank you for your interest!
@ngruver Hi, I wonder whether the code and data to replicate the results in Section 5.3 (Antibody lead optimization: in silico evaluation) has been released? Thanks for your attention!
Hi @tnkyj and @EvaFlower I'm happy to share that the open-source alpha of the LaMBO-2 code has just been released!
https://github.com/prescient-design/cortex
In particular you will likely be interested in this tutorial. The tutorial is a bit simplified, but extending to the full multi-objective setting under uncertainty is relatively straightforward.
We are still in discussions with legal regarding the release of our internal antibody data. I can't make any promises but there have been some encouraging developments. Releasing this kind of dataset represents a substantial culture shift for biotech and pharma, and it will take some time to make that happen.
@samuelstanton. Your efforts in creating this package are greatly appreciated.
@samuelstanton Really appreciate your efforts!!!
Hi guys Thanks a lot for releasing the cortex code. I can't seem to find the full acquistion function implementation anywhere
As you write in the code, there is a simple implementation in the GraphNEI class in _graph_nei.py. The way I see this implementation, it does not use any of the ensemble model's uncertainty in it's evaluation of the candidate points, as the model is not given as input to the acq_functions, and it therefore must concider all point estimates to be equally certain. Is there a reason, apart from compute cost, for using this approach? and can the full implementation be found in your code somewhere?
self.acq_functions = [
qLogExpectedHypervolumeImprovement(
model=None,
ref_point=f_ref,
partitioning=FastNondominatedPartitioning(f_ref, f),
)
for f in f_non_dom
]
...
acq_vals = torch.stack(
[fn._compute_log_qehvi(vals.unsqueeze(0)) for fn, vals in zip(self.acq_functions, obj_val_samples)]
)
return acq_vals.mean(0)
Hi Oliver thanks for the question.
You found the right implementation. It's much simpler than what you might be used to because it's meant to work directly with samples from the model posterior, no need to pass the model around at all. To make things more explicit if you have num_candidates
points you want to evaluate, obj_val_samples
will have shape (num_f_draws, num_candidates, num_objectives)
. When working with deep ensembles, num_f_draws
is the number of ensemble components, but you can actually pass in samples from any posterior as long as they are "coherent", meaning every element of obj_val_samples[i]
was computed from the same function draw $f^{(i)} \sim p(f | \mathcal{D})$.
pseudo-code
posterior = model(candidates)
f_samples = posterior.rsample(num_f_draws)
acq_vals = acq_fn(f_samples)
Thanks Sam
I am still a bit confused. Normally I would expect an acqusition function to take a candidate as input in the form of a num_objectives
-dimensional distribution over the objective space alongside the non dominated points. It would then return a single acq-value based on this distribution.
In botorch the distribution is approximated with MonteCarlo (MC) sampling from the model (for example a GP). I expected that lambo would approximate the distribution of a candiate point with a gaussian based on the mean and std from the ensembles or using MC on the ensembles via botorch. However, when I look at the GraphNEI code, it seems to me that lambo does not input a distribution for a candidate to the acq-function, but rather a point estimate from each ensemble component. The acq function then returns num_f_draws
acq-values. I guess these acq-values are computed as if each point estimate had no uncertainty. These num_f_draws
acq-values are then averaged across the f_draws dimension.
Sorry for a very long comment. I am just trying to undertsand the model. Obviously, it seems to work very well. I guess maybe the diffusion process trained on the training data makes the model select candidates who are closer to the training data and whose objective scores are therefore more certain anyways.
The key idea here is to think of each ensemble component as a function draw from the posterior. What I'm doing here is essentially exactly how BoTorch Monte Carlo estimation works. You have some routine to sample from $p(f | \mathcal{D})$, you pass those samples to a utility function (e.g. hypervolume improvement), then you average across the sample dimension to get an expectation. So yes, each ensemble component prediction is treated as a point prediction because it is the variance of those point predictions across the ensemble components that represents our epistemic uncertainty. In the same way you would treat the samples from a GP posterior as "certain" because the variance across different samples is how you are representing your uncertainty.
Hi,
Thanks for solving my former issue #3.
I notice that your upload code seems not including the LaMBO-2 part. I am really interested in the NOS combining with the multiple objective optimization. Would you consider uploading the relevant code?
Thanks!