mimic sklearn api - Githubissues

knaaptime commented 5 years ago

it would be good to provide a consistent api for all the clustering algos in Region so that the signatures are the same and mimic scikit-learn

currently, the .fit method aliases .fit_from_scipy_sparse_matrix but it might make more sense to instead alias fit_from_w which has a similar signature to scikit's agglomerative w/ constraints

there's also some inconsistency in the parameter names (e.g. n_clusters in SKATER, n_regions in maxp)

bburns commented 5 years ago

Yeah it would be nice if they all had similar interfaces - I wanted to try out a few clustering algorithms - Agglomerative, AZP, and Skater - so made some test wrapper classes.

One problem is that Skater needs the full W adjacency matrix object, while sklearn expects the sparse array - so this interface is not quite consistent with sklearn.


import numpy as np
from sklearn.cluster import AgglomerativeClustering
import pysal # Python Spatial Analysis Library
import region # was recently split out from pysal

class Agglomerative:
    """
    Provide uniform api for sklearn's cluster.AgglomerativeClustering class.
    Primarily needed to pass W instead of W.sparse.
    """

    def __init__(self, W=None, n_clusters=2):
        connectivity = W.sparse if W else None
        self.__model = AgglomerativeClustering(linkage='ward', connectivity=connectivity, n_clusters=n_clusters)

    def fit(self, X, y=None):
        self.__model.fit(X, y)

    @property
    def labels_(self):
        return self.__model.labels_

class Azp:
    """
    Provide uniform api for region's p_regions.azp.AZP class.
    """
    # code: https://github.com/pysal/region/blob/master/region/p_regions/azp.py

    def __init__(self, W, n_clusters=2):
        self.__model = region.p_regions.azp.AZP()
        self.W = W
        self.n_clusters = n_clusters

    def fit(self, X, y=None):
        self.__model.fit_from_scipy_sparse_matrix(adj=self.W.sparse, attr=X, n_regions=self.n_clusters)

    @property
    def labels_(self):
        return self.__model.labels_

class Skater:
    """
    Provide uniform api for region's skater.skater.Spanning_Forest class.
    """
    # code: https://github.com/pysal/region/blob/master/region/skater/skater.py

    def __init__(self, W, n_clusters=2):
        self.__model = region.skater.skater.Spanning_Forest()
        self.W = W
        self.n_clusters = n_clusters

    def fit(self, X, y=None):
        self.__model.fit(n_clusters=self.n_clusters, W=self.W, data=X)

    @property
    def labels_(self):
        return self.__model.current_labels_

algorithms = [Agglomerative, Azp, Skater]

if __name__ == '__main__':

    # node values (just scalars for now)
    values = [0,1,0,0]

    # clustering parameters
    nClusters = 2 # need to specify this in advance

    # get data array
    # note: all the sklearn methods accept standard data matrices of shape [n_samples, n_features].
    # so convert our scalar node values into arrays of 1 feature each.
    areasList = [[a] for a in values] # [[0], [1], [0], [0]]
    areas = np.array(areasList) # region lib needs np arrays

    # get adjacency matrix for a 2x2 grid/lattice
    w = pysal.weights.util.lat2W(2, 2)

    # Comparisons
    for algorithm in algorithms:
        print(algorithm.__name__)
        model = algorithm(n_clusters=nClusters, W=w)
        model.fit(areas)
        print(model.labels_)
        print()

Output (removing the azp print output):

Agglomerative
[0 1 0 0]

Azp
[1. 0. 1. 1.]

Skater
c:/Users/bburns/Desktop/moveto/pipeline/targets/areas/clustering.py:57: OptimizeWarning: By default, the graph is disconnected! Increasing `n_clusters` from 2 to 4 in order to account for islands.
  self.__model.fit(n_clusters=self.n_clusters, W=self.W, data=X)
[0 1 2 3]

Not sure what's up with skater's results - haven't looked into it very much yet.

ljwolf commented 5 years ago

I think skater uses the W object from pysal, and only forces it to be binary, and then only uses the sparse matrix. That'd be real simple to adjust directly in the api, using a connectivity= keyword argument.

In my own research code, I've simply moved to using sparse matrices for input, rather than W objects, but idk how @sjsrey or other new maintainers feel about that.

pysal / region

mimic sklearn api #15