theochem / Selector

Python library of algorithms for selecting diverse subsets of data for machine-learning.
https://selector.qcdevs.org
GNU General Public License v3.0
22 stars 21 forks source link

Identify/Choose Methods for Selecting Diverse Samples #2

Closed PaulWAyers closed 2 years ago

PaulWAyers commented 2 years ago

@Ali-Tehrani suggested using determinantal point processes. There is code (Julia and Python that I found) for doing this https://github.com/theogf/DeterminantalPointProcesses.jl https://github.com/guilgautier/DPPy https://github.com/mbp28/determinantal-point-processes https://github.com/sverdoot/regularized-dpp

This last code implements this paper https://arxiv.org/pdf/1906.04133.pdf .

Some other papers are: http://proceedings.mlr.press/v99/derezinski19a/derezinski19a.pdf https://openreview.net/pdf?id=BkzBwNrlLS

This paper, which @Ali-Tehrani found, suggests that diverse sampling not only improves the performance (speed) but also the accuracy/robustness of kernel methods (based on the abstract, which is all I've read so far) https://arxiv.org/abs/2002.08616

FanwangM commented 2 years ago

Thanks for sharing! Looks very promising! @PaulWAyers @Ali-Tehrani

Paper on DPPy can be found at tools subfolder. I am reading more about determinantal point processes and hope we can employ it. #4

FanwangM commented 2 years ago

A practical example using DPP for diverse subset sampling can be found at Fast mixing Markov chains for strongly rayleigh measures, DPPs, and constrained sampling. @PaulWAyers @Ali-Tehrani

PaulWAyers commented 2 years ago

The strategy I was proposing for using kd-trees to make maximally diverse samples is not new. https://pubs-acs-org.libaccess.lib.mcmaster.ca/doi/abs/10.1021/ci980100c

J. Chem. Inf. Comput. Sci. 1999, 39, 1, 51–58

PaulWAyers commented 2 years ago

Subsumed by #7