Make kCenterGreedy faster

The kCenterGreedy step takes ~180s on;

a training set of 84 images of size 150x150
i7 processor, 20 logical cores
utilizes 16% of the cpu during this step, 1 core at 100% util (subgraph at right top) (the spikes in other graphs are due to the CNN step executed right before this)

which is a lot of time for my use case. I am looking for ways to make this step faster (ideally, <75s). Since each iteration requires access to the previous iteration's result, multiprocessing isn't helping. It does make the speed go from 180s to 75s, but the result isn't desirable since it only takes into account distance of a single image.

Do you have any insights on how can this step be made faster?

class KCenterGreedy:

def __init__(self, X, y, seed, metric="euclidean"):
    self.X = X
    self.y = y
    self.name = "kcenter"
    self.metric = metric
    self.min_distances = None
    self.n_obs = self.X.shape[0]
    self.already_selected = []

def update_distances(self, cluster_centers, only_new=True, reset_dist=False):

    if reset_dist:
        self.min_distances = None

    if only_new:
        cluster_centers = [d for d in cluster_centers if d not in self.already_selected]

    if cluster_centers:

        # Update min_distances for all examples given new cluster center.
        x = self.features[cluster_centers]
        dist = sklearn.metrics.pairwise_distances(self.features, x, metric=self.metric)

        if self.min_distances is None:
          self.min_distances = np.min(dist, axis=1).reshape(-1,1)
        else:
          self.min_distances = np.minimum(self.min_distances, dist)

def select_batch(self, model, already_selected, N, **kwargs):

    # Assumes that the transform function takes in original data and not flattened data.
    if model is not None: self.features = model.transform(self.X)
    else                : self.features = self.X.reshape((self.X.shape[0], -1))

    # Compute distances.
    self.update_distances(already_selected, only_new=False, reset_dist=True)

    # Initialize sampling results.
    new_batch = []

    for _ in rich.progress.track(range(N), description="Sampling..."):

        # Initialize centers with a randomly selected datapoint
        if self.already_selected is None:
            ind = np.random.choice(np.arange(self.n_obs))

        # Otherwise, use the index of minimum distance.
        else:
            ind = np.argmax(self.min_distances)

        # New examples should not be in already selected since those points
        # should have min_distance of zero to a cluster center.
        assert ind not in already_selected

        self.update_distances([ind], only_new=True, reset_dist=False)

        new_batch.append(ind)

    # Memorize the already selected indices.
    self.already_selected = already_selected

    return new_batch

@DHAiRYA2048 ,

Thank you for sending the issue! Yeah, your point is very good. As for the KCenterGreedy, I just used Google's original implementation and did not apply any improvement.

Since each iteration requires access to the previous iteration's result, multiprocessing isn't helping.

Yes, you're right. But I think we can parallelize the update of distance metric after selecting a cluster center. To be specific,

Before (L.84 of knnsearch.py)

dist = sklearn.metrics.pairwise_distances(self.features, x, metric=self.metric)

After

dist = sklearn.metrics.pairwise_distances(self.features, x, metric=self.metric, j_jobs=4)

In the above code, I specified 4 as a number of parallelization, but you seem to have a lot of CPU cores, so you can increase the number. If you already specified the parameter of joblib on the other place, the above modification may have no effect, sorry...

tiskw / patchcore-ad

Make kCenterGreedy faster #8