Open agitter opened 5 years ago
LC1234 has 94857 cpds with 133 actives. The following is the cluster information:
More discussions on these in emails:
Murcko Scaffold ID
, rdkit's BT clustering
, Clustering_0.2
, Clustering 0.3
, Clustering_0.4
. Clustering_*
are the custom implementation.Clustering_0.4
has over 15k+ true singletons out of ~22k unique clusters. Discussion led to suggesting using an adaptive/fuzzy/weighted clustering of boundary cpds. Spencer had this to say which I think is important to this:I was thinking that when compounds are prioritized within a cluster, cpds that also belong to another previously explored cluster could be down-weighted. Structural ambiguity in the boundary cpds is unavoidable--some are likely to be truly chimeric and have scaffolds from 2 clusters.
Everyone agrees that this will add complexity to the prioritization, so for the first implementation, we plan on using fixed clustering methods.
At the UW2020 meeting, there was discussion about how many cluster there should be. 60k for the LC library seemed like a lot. We should follow up with Scott and Spence to decide what types of clustering to use. It could impact the iterative pipeline as much as the next batch selector.