theochem / Selector

Python library of algorithms for selecting diverse subsets of data for machine-learning.
https://selector.qcdevs.org
GNU General Public License v3.0
22 stars 20 forks source link

[Selector module] Add the method in ` MultipleComparisons` #53

Open FanwangM opened 2 years ago

FanwangM commented 2 years ago

Ramon's group has a clever way of doing diverse selector and we have an in-house implementation and should merge it to this repo.

PaulWAyers commented 2 years ago

https://github.com/ramirandaq/MultipleComparisons/tree/master/ECS_MeDiv

(@ramirandaq says this is the most up-to-date version).

PaulWAyers commented 2 years ago

I'm not exactly sure what algorithm is used for diverse-selection here, but this quantifies diversity of the subset well, and also could obviously be combined with a greedy method like used in determinantal point processes.

ramirandaq commented 2 years ago

Hi, just a quick recap of the diversity pickers that we have implemented:

1- Max_nDis (https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00504-4): starts from a random point and looks to minimize the extended similarity of the selected objects.

2- ECS-MeDiv (https://chemrxiv.org/engage/chemrxiv/article-details/62449b5b3b5f991e7cca0670): this is a bit more nuanced because we also implemented several possible starting points: random, medoid (most representative element of the set), outlier (we perform the medoid and outlier selection in O(N)). Then we also seek to minimize the extended similarity, but we added an extra step: in case there are several elements that give the same extended similarity, we select the one that also minimizes the binary similarity.

FanwangM commented 1 year ago

Attempt given in #116.

FanwangM commented 1 year ago

Hi @ramirandaq. Is there any chance you can help finish implementation #116? I am just wondering if we want to have it for our v1 release. Thanks.

PaulWAyers commented 1 year ago

@ramirandaq 's methods basically are diversity measures because they compare several molecules/strings/vectors at once. The diversity measures should perhaps be implemented as such, and the algorithms using them put into "selector." Then any "diversity" measure can be used for n-ary selection algorithms.