Remarks/issues with current weighted exploitation-exploration strategy

Malnammi commented 5 years ago

The current strategy assigns exploitation and exploration weights to clusters in the following manner:

favors clusters with high activity and high density of labeled data.

favors clusters with low coverage and high uncertainty. We also have the option of selecting exploration clusters randomly or a set of dissimilar clusters.

Consider only exploitation clusters with weights >= exploitation-threshold. Count total number of unlabeled molecules here M1 (all are predicted as highly active).
Similarly for exploration clusters with weights >= exploration-threshold. Count total number of unlabeled molecules here M2.
Based on ratio of M1 and M2 allocate percentage of batch-size towards exploration and exploitation.
Now we sample clusters. The first cluster selected is the highest weighted one. Then we sort remaining clusters by λ * disimi + (1− λ) (𝑾_𝒊𝒋 ) where disim_i denote avg cluster dissimilarity to other selected cluster. I.e. each time we select a cluster, we select one that is dissimilar to what was already selected. Alternative, we select a cluster that is dissimilar to ALL other clusters in our data. Can results in very different space-coverage bias.
For each sampled cluster, we then sample instances within that cluster either randomly or selecting a set of dissimilar instances within some vicinity.
Note that we might not be able to sample from all qualifying clusters because of budget constraints. We compute an estimate of per cluster budget based on instance counts, but we set either an equal or proportional budget towards clusters.

The current code for this method is here: link Hyperparameter configs are here: link

Here are some pending issues with this:

What do in case we have no qualifying exploitation or exploration (or both) clusters in step 1 and 2? Should we just select top 50% of clusters based on weights.
In step 2, should we exclude qualifying exploitation clusters from exploration clusters. That is, once a cluster becomes a candidate exploitation cluster, it is removed from being considered an exploration cluster. The alternative would be to still allow a very uncovered cluster with a single highly predicted active molecule to be considered for both exploitation and exploration.
How to incorporate costs? issue #2

agitter commented 5 years ago

@Malnammi I have a question about the exploitation weight. The weight increases as the cluster coverage increases. At some point, wouldn't we want there to be diminishing returns for an active cluster?

Malnammi commented 5 years ago

@agitter my idea for the exploitation weight was:

Activity_i: This is the mean of highly active predictions within that cluster. High activity predictions are defined as those exceeding some threshold.
Coverage_i: This is the fraction of labeled/unlabled molecules in the cluster. Clusters with more coverage (labeled molecules), then the model might be more confident/robust in that part of the space.

During the computation of exploitation weights for the clusters, if the cluster has no highly active predictions (exceeding the threshold), then its default Activity_i will be zero. In other words, it will be completely weighted by its coverage (i.e. W_i_exploit <= 0.5). It will be outranked by any cluster with Activity_i > 0.

This also begs another issue: what do we do if all the clusters have Activity_i = 0? Do we want to weigh based on Coverage_i alone? Or stop exploiting and focus more on exploration till our model becomes more confident?

We discussed that activity predictions ranges are model dependent; i.e. small datasets typically give low range of predictions [0,0.4] for random forest. In the current implementation we have a temporary remedy for this where we set the parameter for thresholding using a quantile rather than an absolute. Specifically, using a quantile of 0.5, then the threshold for highly active unlabeled molecules are those >= median of unlabeled prediction.

agitter commented 5 years ago

I see, so the coverage is used to estimate confidence, not diminishing returns.

what do we do if all the clusters have Activity_i = 0?

My initial thought is that it would make sense to focus on exploration, as you suggested.

For the activity prediction ranges, this temperature scaling method is the one Jay tested: https://arxiv.org/pdf/1706.04599.pdf I'm not certain that it is relevant for us.

gitter-lab / active-learning-drug-discovery

Remarks/issues with current weighted exploitation-exploration strategy #1