Open xychem opened 1 year ago
A possibility could be decreasing r
until the desired number of samples (or more) are selected. In the case of a transition ( with a slight change in r
) from fewer selected samples than needed to more selected samples than needed, the last added samples will be "degenerate" (quoting @PaulWAyers). In this case, we can just drop the last selected samples (these will be degenerate) until only the required number of samples is selected.
@xychem Hello, This is Aditi. Can I give it a try? Can you please provide me more detail on it?
@FanwangM can you help @Aditish51 with this?
Hi, Aditi. I hope these information below can help you!
When you run DSE example in the notebook, the order of functions which are called is below.
The idea of the DSE is that (code is in algorithm in DiverseSelector/methods/partition.py)
# calculate distance of all samples from reference sample; distance is a (n_samples,) array
distances = scipy.spatial.minkowski_distance(X[self.ref_index], X, p=self.p)
# find index of all samples within radius of sample idx (this includes the sample
# index itself)
index_exclude = kdtree.query_ball_point(
X[idx], self.r, eps=self.eps, p=self.p, workers=-1
)
# exclude samples within radius r of sample idx (measure by Minkowski p-norm) from
# future consideration by setting their bitarray value to 1
for index in index_exclude:
bv[index] = 1
if len(selected) > max_size:
return selected
When the r is larger, we will select fewer points (in which their distances between each other are larger than r). When the r is smaller, we can select more points (in which their distances between each other are larger than r). As above, one significant thing is to optimize r(radius) to get a proper r (the number of the points (in which their distances between each other are larger than r) is equal to what we want) which is coded in DiverseSelector/methods/utils.py. This issue is about there exists some situations that in special r, some points will be "degenerate" (quoting @PaulWAyers). We can see the list above when r > 1.919372827, the selected number = 3; when r $\leq$ 1.919372826, the selected number = 5.
What I thought is as same as @marco-2023 , I droped the last selected samples after the iteration. However I think the different selected samples maybe cause different consequence (more or less) when selecting small samples (like in the notebook, we just select 4 points in each cluster, the weight of one sample is larger, so the consequence may be different). The code below (in DiverseSelector/methods/utils.py) is one way to drop last selected samples (I think it's not good because the if condition in the while loop, which cause larger caclulation). I think you can directly drop the last selected samples by using array.pop() out of the while loop and in this way, you can also drop special selected samples, not just last one.
while (len(selected) < lower_size or len(selected) > upper_size) and (n_iter < obj.n_iter+1):
# change sphere radius based on the defined bound
if bounds[1] == np.inf:
# make sphere radius larger by a factor of 2
obj.r = bounds[0] * 2
else:
# make sphere radius smaller by a factor of 1/2
obj.r = (bounds[0] + bounds[1]) / 2
# re-select samples with the new radius
if n_iter < obj.n_iter:
selected = obj.algorithm(X, upper_size)
# the selected number is sensitive to r
else:
selected = obj.algorithm(X,size-1)
# adjust lower/upper bounds of radius range
if len(selected) > size:
bounds[0] = obj.r
else:
bounds[1] = obj.r
n_iter += 1
Thanks @xychem !!
In the DirectedSphereExclusion method, the setected number is related to r(radius). When r is larger, we will get fewer molecules; otherwise we will get more molecules. The function optimize_radius of utils.py is used to optimize r through iteration. ( When selected number is larger, we decrease r; otherwise we can increase r. )
But in the case which we choose 12 points in 3 clusters by using DirectedSphereExclusion, the setected number is sensitive to r which causes the oscillation of selected number. We can see when r > 1.919372827, the selected number = 3; when r $\leq$ 1.919372826, the selected number = 5. ( Which means existing two points which are "close" enough. )
The previous situation (11 points)
The present situation (13 points)