gitter-lab / active-learning-drug-discovery

End-to-end active learning pipeline for virtual screening and drug discovery
MIT License
3 stars 0 forks source link

Optimizing clustering #6

Open agitter opened 5 years ago

agitter commented 5 years ago

At the UW2020 meeting, there was discussion about how many cluster there should be. 60k for the LC library seemed like a lot. We should follow up with Scott and Spence to decide what types of clustering to use. It could impact the iterative pipeline as much as the next batch selector.

Malnammi commented 5 years ago

LC1234 has 94857 cpds with 133 actives. The following is the cluster information:

Malnammi commented 5 years ago

More discussions on these in emails:

I was thinking that when compounds are prioritized within a cluster, cpds that also belong to another previously explored cluster could be down-weighted. Structural ambiguity in the boundary cpds is unavoidable--some are likely to be truly chimeric and have scaffolds from 2 clusters.

Everyone agrees that this will add complexity to the prioritization, so for the first implementation, we plan on using fixed clustering methods.