gitter-lab / active-learning-drug-discovery

End-to-end active learning pipeline for virtual screening and drug discovery
MIT License
3 stars 0 forks source link

Ordinal regression #7

Open agitter opened 5 years ago

agitter commented 5 years ago

Prof. Raschka had an idea of prioritizing clusters instead of individual compounds. His idea was to use ordinal regression to predict the number of actives in the cluster. This would require featurizing clusters with a consensus fingerprint or other feature summarizations.

We do not want to change our approach, but we should consider the pros and cons of this idea so that we know the strengths of our approach.

Malnammi commented 5 years ago

This is somewhat related to consensus fingerprint for a cluster. The cluster-based-selector now supports an option for computing cluster dissimilarity using consensus fingerprints rather than comparing every instance within each cluster. The formula for consensus fingerprint of cluster ci:

ci_instances = np.where(clusters == ci)[0] X_consensus = ((np.sum(X[ci_instances,:], axis=0) / ci_instances .shape[0]) >= 0.5).astype(float)

In words, we set the bit at position i if the majority of the instances have that bit set. Randomly applying this dissimilarity computation on 20 dense clusters gives results that are mostly similar to the instance-by-instance method (within +- 0.04 in most cases, few cases had +-0.1).

This consensus method should reduce overall memory costs.