gitter-lab / active-learning-drug-discovery

End-to-end active learning pipeline for virtual screening and drug discovery
MIT License
3 stars 0 forks source link

Incorporating cost #2

Open Malnammi opened 5 years ago

Malnammi commented 5 years ago

See strategy at #1.

Currently, the code implements budget constraints via batch_size with current parameters of [96, 384, 1536] relating to microplate sizes in practice. The problem is that we don't consider molecule costs when selecting clusters/instances in our strategy. We purely exhaust the batch_size.

An alternative would be to use a combination of budget and batch_size, where we want to exhaust the batch_size but not go over the budget.

I see two methods of doing this:

  1. Incorporate cost into the exploitation and exploration weight equations via avg cluster cost. This would mean the strategy tries to also select clusters with low avg cost.
  2. The cost comes into play when selecting instances from a cluster. If the cost of the sampled instance exceeds a certain percentage of the overall budget, then that instance is dropped.

I am leaning towards method 2.

Please discuss or propose any other solutions.

agitter commented 5 years ago

One important consideration is whether compounds all have uniform cost. Scott is exploring this by obtaining quotes from different vendors.

Malnammi commented 5 years ago

Additional feedback from the group. We now have multiple costs to consider when comparing an iterative screening effort vs an one-big-screen.

  1. Compound Cost: This cost is associated with purchasing molecules. A molecule might already be procured and thus have a cost of 0.
  2. Labor Cost: This generally includes the time-cost of procuring the molecule, setting up and running the physical experiment, and getting back the digital results. We will have a ballpark number for these.

For simulation purposes, each iteration of the active learning pipeline, we record various evaluation metrics. In addition, we should record these cost metrics as well for later analysis.

agitter commented 5 years ago

There are at least three modes for iterative screening and cherry picking.

Mode 1: cherry pick from compounds at SMSF (LC and MLPCN libraries). The compound cost is low, the labor cost is high because it may involve selecting a different plate for each compound in the batch.

Mode 2: purchasing compounds from a vendor like ChemDiv. The vendor would have a library of > 1 million compounds. They would likely prepare fixed-sized plates for us so the labor of cherry picking would be incorporated in the compound cost. At least for some vendors, we can get a quote for a constant cost per compound.

Mode 3: a virtual library from multiple vendors like ZINC. There would be high labor cost if it takes a lot of time to assess which prioritized compounds can even be purchased. There would be variable compound cost.