theochem / Selector

Python library of algorithms for selecting diverse subsets of data for machine-learning.
https://selector.qcdevs.org
GNU General Public License v3.0
22 stars 22 forks source link

Fix tolerance value #158

Closed xychem closed 1 year ago

xychem commented 1 year ago

The number of upper size means the max number we selested. upper_size = tol*size, if tol=5.0, the upper_size will be larger than we want.

codecov[bot] commented 1 year ago

Codecov Report

Merging #158 (32a8733) into main (d0b9c79) will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #158   +/-   ##
=======================================
  Coverage   40.64%   40.64%           
=======================================
  Files           8        8           
  Lines         556      556           
=======================================
  Hits          226      226           
  Misses        330      330           
Files Changed Coverage Δ
DiverseSelector/methods/dissimilarity.py 19.04% <100.00%> (ø)
xychem commented 1 year ago

This is reason for why we get 8 and 6 data points instead of the queried 12 for "Adapted Optimizable K-Dissimilarity Selection (OptiSim)" #147

FanwangM commented 1 year ago

Can you also add a testing for this change? @xychem Thanks.

FarnazH commented 1 year ago

@marco-2023 can you please review this change to make sure it is resolving the issue you were in reproducing the old notebook results? Thanks.

I checked the earlier versions of the code and noticed that the tolerance was 5.0. However the predict_radius implementation was changed, so it may not be used in the same way anymore.

marco-2023 commented 1 year ago

@FarnazH & @xychem: I went through the old and new code. In the old code the tol value of the object was interpreted by the predict_radius as the percent (of the number of samples to select) tolerance and in the new code is the fraction (of the number of samples to select) tolerance. The proposed tol=0.05 corresponds exactly to the old code's tol=5.0. The proposed change solves the issue with the method.

FarnazH commented 1 year ago

@marco-2023, you should have a This pull request is waiting on your review. comment (in a yellow box) at the top of this PR. Based on your comment, please approve this PR so that it can be merged.

FanwangM commented 1 year ago

Thanks for your comments and review. @marco-2023 I am going to merge this PR for now.