theochem / Selector

Python library of algorithms for selecting diverse subsets of data for machine-learning.
https://selector.qcdevs.org
GNU General Public License v3.0
22 stars 21 forks source link

some point view for DirectedSphereExclusion selested number relate to #152 #159

Open xychem opened 1 year ago

xychem commented 1 year ago

In the DirectedSphereExclusion method, the setected number is related to r(radius). When r is larger, we will get fewer molecules; otherwise we will get more molecules. The function optimize_radius of utils.py is used to optimize r through iteration. ( When selected number is larger, we decrease r; otherwise we can increase r. )

But in the case which we choose 12 points in 3 clusters by using DirectedSphereExclusion, the setected number is sensitive to r which causes the oscillation of selected number. We can see when r > 1.919372827, the selected number = 3; when r $\leq$ 1.919372826, the selected number = 5. ( Which means existing two points which are "close" enough. )


iteration_number 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
selected_number 2 2 3 3 5 3 3 5 3 5 5 5 5 5 5 3 3 3 5 5 5 5 5 3 3 5 5 3 3 3 3 5 3 3 5 3 5 3 3 3 5 5 5 3 5 5 5 5 3 5 3 5 3 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
r 3.573701539 2.680276154 2.233563462 2.010207116 1.898528943 1.954368029 1.926448486 1.912488714 1.9194686 1.915978657 1.917723629 1.918596114 1.919032357 1.919250479 1.919359539 1.91941407 1.919386805 1.919373172 1.919366356 1.919369764 1.919371468 1.91937232 1.919372746 1.919372959 1.919372852 1.919372799 1.919372826 1.919372839 1.919372833 1.919372829 1.919372828 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827 1.919372827

The previous situation (11 points) image

The present situation (13 points) image

marco-2023 commented 1 year ago

A possibility could be decreasing r until the desired number of samples (or more) are selected. In the case of a transition ( with a slight change in r) from fewer selected samples than needed to more selected samples than needed, the last added samples will be "degenerate" (quoting @PaulWAyers). In this case, we can just drop the last selected samples (these will be degenerate) until only the required number of samples is selected.

Aditish51 commented 6 months ago

@xychem Hello, This is Aditi. Can I give it a try? Can you please provide me more detail on it?

PaulWAyers commented 6 months ago

@FanwangM can you help @Aditish51 with this?

xychem commented 6 months ago

Hi, Aditi. I hope these information below can help you!

structure of software

When you run DSE example in the notebook, the order of functions which are called is below.

  1. abstract class SelectionBase in DiverseSelector/methods/base.py
  2. function select_from_cluster of class DirectedSphereExclusion in DiverseSelector/methods/partition.py
  3. function optimize_radius in DiverseSelector/methods/utils.py
  4. function algorithm of class DirectedSphereExclusion in DiverseSelector/methods/partition.py

structure of DSE algorithm

The idea of the DSE is that (code is in algorithm in DiverseSelector/methods/partition.py)

  1. first set r (radius) and choose a reference point, calculate the distance between reference point and other points
    # calculate distance of all samples from reference sample; distance is a (n_samples,) array
        distances = scipy.spatial.minkowski_distance(X[self.ref_index], X, p=self.p)
  2. choose a nearest point (P) to reference point as a selected sample and calculate distance between other points and P, then ignore the distrance which is less than r
    # find index of all samples within radius of sample idx (this includes the sample
              # index itself)
              index_exclude = kdtree.query_ball_point(
                  X[idx], self.r, eps=self.eps, p=self.p, workers=-1
              )
              # exclude samples within radius r of sample idx (measure by Minkowski p-norm) from
              # future consideration by setting their bitarray value to 1
              for index in index_exclude:
                  bv[index] = 1
  3. add another point (Q) in which the distance Q and P is larger than r, then calculate distance bewteen other points and selected points (Q and P) and ignore the distance which is less than r
  4. repeat the progress until the number of selected_points is larger than max_size or all idx in index_sorted are used
    if len(selected) > max_size:
                    return selected

    problem

    When the r is larger, we will select fewer points (in which their distances between each other are larger than r). When the r is smaller, we can select more points (in which their distances between each other are larger than r). As above, one significant thing is to optimize r(radius) to get a proper r (the number of the points (in which their distances between each other are larger than r) is equal to what we want) which is coded in DiverseSelector/methods/utils.py. This issue is about there exists some situations that in special r, some points will be "degenerate" (quoting @PaulWAyers). We can see the list above when r > 1.919372827, the selected number = 3; when r $\leq$ 1.919372826, the selected number = 5.

one solution

What I thought is as same as @marco-2023 , I droped the last selected samples after the iteration. However I think the different selected samples maybe cause different consequence (more or less) when selecting small samples (like in the notebook, we just select 4 points in each cluster, the weight of one sample is larger, so the consequence may be different). The code below (in DiverseSelector/methods/utils.py) is one way to drop last selected samples (I think it's not good because the if condition in the while loop, which cause larger caclulation). I think you can directly drop the last selected samples by using array.pop() out of the while loop and in this way, you can also drop special selected samples, not just last one.

 while (len(selected) < lower_size or len(selected) > upper_size) and (n_iter < obj.n_iter+1):
        # change sphere radius based on the defined bound
        if bounds[1] == np.inf:
            # make sphere radius larger by a factor of 2
            obj.r = bounds[0] * 2
        else:
            # make sphere radius smaller by a factor of 1/2
            obj.r = (bounds[0] + bounds[1]) / 2

        # re-select samples with the new radius
        if n_iter < obj.n_iter:
            selected = obj.algorithm(X, upper_size)
        # the selected number is sensitive to r
        else:
            selected = obj.algorithm(X,size-1)   

        # adjust lower/upper bounds of radius range
        if len(selected) > size:
            bounds[0] = obj.r
        else:
            bounds[1] = obj.r

        n_iter += 1
PaulWAyers commented 6 months ago

Thanks @xychem !!