Strategy Evaluation - Githubissues

Malnammi commented 5 years ago

See strategy at #1.

Currently, strategy evaluation looks at 3 main metrics for each selected batch:

n_hits: how many hits.
norm_hits_ratio: the ratio of the number of hits to the max number of hits possible for this batch. Max hits is a function of the remaining unlabeled instances and the batch_size; i.e. if we have 100 remaining unlabeled hits and batch_size=96, then max_hits=min(96, 100).
n_cluster_hits: how many unique clusters with actives. Note this does not necessarily look at new clusters found (clusters not in the training set), it merely computes the number of unique clusters with hits in the selected batch.
norm_cluster_hits_ratio: similar to norm_hits_ratio but for cluster hits.
novelty_n_hits: This was a metric @agitter and I discussed in the past. The formula is novel_n_hits = w norm_hits_ratio + (1-w) norm_cluster_hits_ratio

These metrics can then be used to compare different strategies.

Malnammi commented 5 years ago

Update to the batch metrics calculated for each iteration:

n_hits: how many hits.
norm_hits_ratio: the ratio of the number of hits to the max number of hits possible for this batch. Max hits is a function of the remaining unlabeled instances and the batch_size; i.e. if we have 100 remaining unlabeled hits and batch_size=96, then max_hits=min(96, 100).
n_cluster_hits: how many unique clusters with actives. Note this does not necessarily look at new clusters found (clusters not in the training set), it merely computes the number of unique clusters with hits in the selected batch.
norm_cluster_hits_ratio: similar to norm_hits_ratio but for cluster hits.
novelty_n_hits: This takes into account two cluster ID sets. The training_hit_clusters and batch_hit_clusters. novelty_n_hits = | batch_hit_clusters SETDIFF training_hit_clusters| This is the number of newly identified clusters with hits.
batch_size:: This is recorded in the form of two metrics: batch_size = exploitation_batch_size + exploration_batch_size So we can keep track of the algorithm's exploit-explore budget allocation every iteration.
computation_time: The computation time taken to select this iteration's batch. While this is machine-dependent, it will still be helpful to keep these in our records.
batch_cost: The cost of the selected batch.
screening_time: Time estimate of the time taken to: cherry-pick the selected batch, schedule, physically screen, and data-retrieve for the next computation iteration. This will be added to the general config files as an estimate for each screen. Later on we can define this in more detail at the plate or molecule level. Also see issue #2. *screening_time = cherry_picking_time_per_cpd batch_size + screening_time_per_plate**

Please add any others you can think of.

agitter commented 5 years ago

These look comprehensive to me. Do we want to track how well the classifier performed in the last round or how well-calibrated it is? We could evaluate how it performed on the last batch of compounds by comparing its activity prediction and confidence with the new experimental screening data.

This is slightly related to Prof. Raschka's idea to make sure the classifier doesn't get worse over the iterations and start making mistakes on examples it correctly classified previously.

Malnammi commented 5 years ago

@agitter Do you mean like this:

At iteration i, given data for batch i-1, evaluate batch i-1 using AL metrics.
Train model using batches 0, 1, ..., i-1.
Evaluate trained model on batch i-2 using appropriate model metrics.
Select batch i.
Pass to selected batch to screening facility.

Malnammi commented 5 years ago

Alternative model evaluation:

train on batches 0, 1, ..., i-2.
evaluate model quality on batch i-1 data.
record suitable model quality metrics. (need to determine).

agitter commented 5 years ago

The above model evaluation is what I had in mind.

Prof. Raschka's idea is less clear to me. It may involve tracking your overall model quality as you train on more and more data, but how to do that is ambiguous. You could do something like cross-validation on batches 0,...,i-1 at each batch, but those metrics would be for different size datasets and not easily comparable. We can probably drop this idea for now.

gitter-lab / active-learning-drug-discovery

Strategy Evaluation #3