ngruver / NOS

Protein Design with Guided Discrete Diffusion
https://arxiv.org/abs/2305.20009
MIT License
107 stars 8 forks source link

Cannot find the code of guided diffusion with multiple objectives #4

Open EvaFlower opened 11 months ago

EvaFlower commented 11 months ago

Hi,

Thanks for solving my former issue #3.

I notice that your upload code seems not including the LaMBO-2 part. I am really interested in the NOS combining with the multiple objective optimization. Would you consider uploading the relevant code?

Thanks!

Morell1123 commented 10 months ago

Hi, LamBO-2 works fine for me when I replicate their sasa experiments. You need to first train the diffusion model with guidance (see last 3 parameters below) change "OBJECTIVE NAME 1" to "sasa" for their example

PYTHONPATH="." python scripts/train_seq_model.py \ model=[MODEL TYPE] \ model.optimizer.lr=[MODEL LR] \ data_dir=[DATASET DIRECTORY] \ train_fn=[TRAINING CSV FILE] \ val_fn=[VALIDATION CSV FILE] \ vocab_file=[VOCAB FILE IN THIS REPO'S BASE DIR] \ log_dir=[LOGGING DIRECTORY] 'target_cols=[[OBJECTIVE NAME 1], ..., [OBJECTIVE NAME K]]' \ model.network.target_channels=[K] \ discr_batch_ratio=[RATIO OF GENERATIVE LOSS UPDATES TO DISCRIMINATIVE] \

Then, you can select one of your model weights checkpoints, and do the LamBO-2 sampeling, which optimises for your model's objectives PYTHONPATH="." python scripts/control/sample_diffusion.py \ model=mlm \ model.network.target_channels=1 \ ckpt_path=[CKPT PATH, MODEL TRAINED FOR BETA SHEETS] \ +guidance_kwargs.step_size=1.0 \ +guidance_kwargs.stability_coef=0.01 \ +guidance_kwargs.num_steps=10 \ +seeds_fn=[PATH TO poas_seeds.csv] \ +results_dir=[RESULTS DIR] \

ngruver commented 10 months ago

@EvaFlower We are planning to release updated LaMBO-2 code through the Prescient design repository before the paper's presentation at NeurIPS. This new codebase will include the full functionality for partial deep ensembles and the multi-objective acquisition functions described in the paper. Open source moves a bit slower in biotech because of intellectual property concerns. Thank you for your patience!

Morell1123 commented 10 months ago

@ngruver Is LaMBO-2 not fully functional at the moment if one uses train_seq_model.py and subsequently sample_diffusion.py on ones own objectives? If not, what parts are missing?

ngruver commented 10 months ago

Yep, it is fully functional to run on your own objectives, and you could simply work with your own implementation of multi-objective acquisition functions, or optimize a simple linear scalarization in order to optimize multiple objective. You should be able to reproduce the experiments with the in silico objectives. The update will provide the additional primitives for multi-objective Bayesian optimization described in the paper, which were used in the in vitro antibody design experiments.

Morell1123 commented 10 months ago

Thanks a lot for your reply. Lets say one has only a small dataset for antibodies with objective scores. Is it possible to train the unguided infilling using your antibody sequences and then subsequently expand the model with the regression head and fine tune to ones antibody sequences with objective scores?

ngruver commented 10 months ago

Yes that should work!

Morell1123 commented 10 months ago

About the coming update: Are the independent discriminative model, which you used to filter the 30K LaMBO-2 suggestions down to 68, going to be disclosed in the NeurIPS update? Also, can you say, how many sequences you had in the original seeds pool you used in the Antibody Lead Optimization? (Or was it only 3 as signified by the 3 "x" signs in the plot?) image

ngruver commented 10 months ago

Sadly the ranking models were trained on data internal to Genentech, and therefore we will not be able to release the weights. We will try to add a more in depth description of how these models were trained in our next arxiv update. Re: seeds, that is correct. You can also take a look at Figure 15 in the current version on arxiv: https://arxiv.org/pdf/2305.20009.pdf

tnkyj commented 7 months ago

I'm sorry for the repeated question, but has the complete version of the LaMBO-2 code already been released? If it has been released, could you please provide the URL to the repository?

ngruver commented 7 months ago

@samuelstanton

tnkyj commented 7 months ago

The question was insufficient. I'm very interested in the complete code of LaMBO-2 as mentioned in this comment (especially the combination of the diffusion model and partial deep ensembles) Are you suggesting that we can supplement the missing parts of the current code by referencing LaMBO?

ngruver commented 7 months ago

Apologies. I tagged Sam because he is actively working on the release of the LaMBO-2 code and could give you an update. I believe he is on track to release it in upcoming week or two, but I'm not really sure, as the repo will be the intellectual property of prescient design, and I am not an employee.

tnkyj commented 7 months ago

I apologize for the misunderstanding. I appreciate your generous response.

ngruver commented 7 months ago

No worries at all, and thank you for your interest!

EvaFlower commented 6 months ago

@ngruver Hi, I wonder whether the code and data to replicate the results in Section 5.3 (Antibody lead optimization: in silico evaluation) has been released? Thanks for your attention!

samuelstanton commented 6 months ago

Hi @tnkyj and @EvaFlower I'm happy to share that the open-source alpha of the LaMBO-2 code has just been released!

https://github.com/prescient-design/cortex

In particular you will likely be interested in this tutorial. The tutorial is a bit simplified, but extending to the full multi-objective setting under uncertainty is relatively straightforward.

We are still in discussions with legal regarding the release of our internal antibody data. I can't make any promises but there have been some encouraging developments. Releasing this kind of dataset represents a substantial culture shift for biotech and pharma, and it will take some time to make that happen.

tnkyj commented 6 months ago

@samuelstanton. Your efforts in creating this package are greatly appreciated.

EvaFlower commented 6 months ago

@samuelstanton Really appreciate your efforts!!!

Morell1123 commented 6 months ago

Hi guys Thanks a lot for releasing the cortex code. I can't seem to find the full acquistion function implementation anywhere

As you write in the code, there is a simple implementation in the GraphNEI class in _graph_nei.py. The way I see this implementation, it does not use any of the ensemble model's uncertainty in it's evaluation of the candidate points, as the model is not given as input to the acq_functions, and it therefore must concider all point estimates to be equally certain. Is there a reason, apart from compute cost, for using this approach? and can the full implementation be found in your code somewhere?

        self.acq_functions = [
            qLogExpectedHypervolumeImprovement(
                model=None,
                ref_point=f_ref,
                partitioning=FastNondominatedPartitioning(f_ref, f),
            )
            for f in f_non_dom
        ]

...

        acq_vals = torch.stack(
            [fn._compute_log_qehvi(vals.unsqueeze(0)) for fn, vals in zip(self.acq_functions, obj_val_samples)]
        )
        return acq_vals.mean(0)
samuelstanton commented 6 months ago

Hi Oliver thanks for the question.

You found the right implementation. It's much simpler than what you might be used to because it's meant to work directly with samples from the model posterior, no need to pass the model around at all. To make things more explicit if you have num_candidates points you want to evaluate, obj_val_samples will have shape (num_f_draws, num_candidates, num_objectives). When working with deep ensembles, num_f_draws is the number of ensemble components, but you can actually pass in samples from any posterior as long as they are "coherent", meaning every element of obj_val_samples[i] was computed from the same function draw $f^{(i)} \sim p(f | \mathcal{D})$.

pseudo-code

posterior = model(candidates)
f_samples = posterior.rsample(num_f_draws)
acq_vals = acq_fn(f_samples)
Morell1123 commented 6 months ago

Thanks Sam I am still a bit confused. Normally I would expect an acqusition function to take a candidate as input in the form of a num_objectives-dimensional distribution over the objective space alongside the non dominated points. It would then return a single acq-value based on this distribution. In botorch the distribution is approximated with MonteCarlo (MC) sampling from the model (for example a GP). I expected that lambo would approximate the distribution of a candiate point with a gaussian based on the mean and std from the ensembles or using MC on the ensembles via botorch. However, when I look at the GraphNEI code, it seems to me that lambo does not input a distribution for a candidate to the acq-function, but rather a point estimate from each ensemble component. The acq function then returns num_f_draws acq-values. I guess these acq-values are computed as if each point estimate had no uncertainty. These num_f_draws acq-values are then averaged across the f_draws dimension.

Sorry for a very long comment. I am just trying to undertsand the model. Obviously, it seems to work very well. I guess maybe the diffusion process trained on the training data makes the model select candidates who are closer to the training data and whose objective scores are therefore more certain anyways.

samuelstanton commented 6 months ago

The key idea here is to think of each ensemble component as a function draw from the posterior. What I'm doing here is essentially exactly how BoTorch Monte Carlo estimation works. You have some routine to sample from $p(f | \mathcal{D})$, you pass those samples to a utility function (e.g. hypervolume improvement), then you average across the sample dimension to get an expectation. So yes, each ensemble component prediction is treated as a point prediction because it is the variance of those point predictions across the ensemble components that represents our epistemic uncertainty. In the same way you would treat the samples from a GP posterior as "certain" because the variance across different samples is how you are representing your uncertainty.