scikit-learn-contrib / scikit-learn-extra

scikit-learn contrib estimators
https://scikit-learn-extra.readthedocs.io
BSD 3-Clause "New" or "Revised" License
187 stars 42 forks source link

Notebook for Min-Max linkage #93 #125

Closed NimaSarajpoor closed 1 year ago

NimaSarajpoor commented 3 years ago

Create Tutorial Notebook for Min-Max linkage and provide its naive implementation, corresponding to #93

Hello,

This is my first time trying to contribute to scikit-learn-extra. It was hard for me to understand the source code. Therefore, as the first step, I created a notebook and provided a naive implementation. I believe this can be helped us with the unit test.

I was wondering if someone can guide me through reading the source code and help me implement this for the scikit-learn-extra package.

TimotheeMathieu commented 3 years ago

Hello,

Thank you this looks interesting.

I think, given the subject, you can try to understand the outline of the code of KMedoids and do the same, the idea is to stay in line with scikit-learn code. To help you here is the skeleton I would use if I were to make a code for your notebook in a file sklearn_extra/cluster/agglomerative_clustering_minmax.py for instance:

class Agglemorative_clustering_minmax(BaseEstimator, ClusterMixin, TransformerMixin):
    """
    # Add short description here
    Parameters
    ----------
    X: array-like, shape (n_samples, n_features) or (n_samples, n_samples)

    affinity: "precomputed" or "euclidean" (default) 
    Metric used to compute the distance between any two samples

    n_clusters: int, default=2
    The number of clusters to find. 
    Attributes
    ----------
    # Add attribute (for clusterMixin you must at least have labels_ 
     and n_iter_ . I don't think there are any centers or inertia in your definition 
     of clustering so at first just labels_ and n_iter_ is alright.
    Examples
    --------
    # Add a short example of use
    References
    ----------
    # Add the article
    """
    def __init__(
            self,
            n_clusters=8,
            metric="euclidean",
            max_iter=300,
        ):
            self.n_clusters = n_clusters 
            self.metric = metric # instead of affinity I would use metric to be homogeneous with KMedoids
            self.max_iter = max_iter # this is a bound on the number of iterations. In order not to have infinite loop if there is some bug
    def fit(self, X, y=None): 
        """Fit Agglemorative clustering to the provided data.
        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = (n_samples, n_features), \
                or (n_samples, n_samples) if metric == 'precomputed'
            Dataset to cluster.
        y : Ignored
        Returns
        -------
        self
        """
        # Put your code here to compute labels_
        # labels_ is what you called "clusters", i.e. the output of your algorithm
        # and n_iter_ the number of iterations used
        return self

    def transform(self, X):
        """Transforms X to cluster-distance space.
        Parameters
        ----------
        X : {array-like, sparse matrix}, shape (n_query, n_features), \
                or (n_query, n_indexed) if metric == 'precomputed'
            Data to transform.
        Returns
        -------
        X_new : {array-like, sparse matrix}, shape=(n_query, n_clusters)
        """
        # I am not certain that it is alright to do it like this but here is what 
        # I would do : output the result of the r function for each point. 
        # i.e. the min_{x \in C} d_{max}(x, C)
        return r(X)

    def predict(self, X):
        """Predict the closest cluster for each sample in X.
        Parameters
        ----------
        X : {array-like, sparse matrix}, shape (n_query, n_features), \
                or (n_query, n_indexed) if metric == 'precomputed'
            New data to predict.
        Returns
        -------
        labels : array, shape = (n_query,)
            Index of the cluster each sample belongs to.
        """
        # Add the cluster computation here, can be used on new points not only 
        # on the training data. 
        return clusters

It would be also good to make some additional functions in the class like maybe a function that does the update of the distance matrix.

Once the code is done, you would also have to

Don't hesitate to ask if you have any question. I wrote part of KMedoids code so I understand it pretty well.

EDIT : most of the code is a suggestion not an official guideline, feel free to improve the basic ideas I gave.

NimaSarajpoor commented 3 years ago

@TimotheeMathieu Thanks a lot for your input. You are right. I might take a look at Agglemorative Clustering in scikit-learn as well and see how the clusters are stored at each cut.

I start with your guideline. Thanks!!