Closed knaaptime closed 4 years ago
Yeah it would be nice if they all had similar interfaces - I wanted to try out a few clustering algorithms - Agglomerative, AZP, and Skater - so made some test wrapper classes.
One problem is that Skater needs the full W adjacency matrix object, while sklearn expects the sparse array - so this interface is not quite consistent with sklearn.
import numpy as np
from sklearn.cluster import AgglomerativeClustering
import pysal # Python Spatial Analysis Library
import region # was recently split out from pysal
class Agglomerative:
"""
Provide uniform api for sklearn's cluster.AgglomerativeClustering class.
Primarily needed to pass W instead of W.sparse.
"""
def __init__(self, W=None, n_clusters=2):
connectivity = W.sparse if W else None
self.__model = AgglomerativeClustering(linkage='ward', connectivity=connectivity, n_clusters=n_clusters)
def fit(self, X, y=None):
self.__model.fit(X, y)
@property
def labels_(self):
return self.__model.labels_
class Azp:
"""
Provide uniform api for region's p_regions.azp.AZP class.
"""
# code: https://github.com/pysal/region/blob/master/region/p_regions/azp.py
def __init__(self, W, n_clusters=2):
self.__model = region.p_regions.azp.AZP()
self.W = W
self.n_clusters = n_clusters
def fit(self, X, y=None):
self.__model.fit_from_scipy_sparse_matrix(adj=self.W.sparse, attr=X, n_regions=self.n_clusters)
@property
def labels_(self):
return self.__model.labels_
class Skater:
"""
Provide uniform api for region's skater.skater.Spanning_Forest class.
"""
# code: https://github.com/pysal/region/blob/master/region/skater/skater.py
def __init__(self, W, n_clusters=2):
self.__model = region.skater.skater.Spanning_Forest()
self.W = W
self.n_clusters = n_clusters
def fit(self, X, y=None):
self.__model.fit(n_clusters=self.n_clusters, W=self.W, data=X)
@property
def labels_(self):
return self.__model.current_labels_
algorithms = [Agglomerative, Azp, Skater]
if __name__ == '__main__':
# node values (just scalars for now)
values = [0,1,0,0]
# clustering parameters
nClusters = 2 # need to specify this in advance
# get data array
# note: all the sklearn methods accept standard data matrices of shape [n_samples, n_features].
# so convert our scalar node values into arrays of 1 feature each.
areasList = [[a] for a in values] # [[0], [1], [0], [0]]
areas = np.array(areasList) # region lib needs np arrays
# get adjacency matrix for a 2x2 grid/lattice
w = pysal.weights.util.lat2W(2, 2)
# Comparisons
for algorithm in algorithms:
print(algorithm.__name__)
model = algorithm(n_clusters=nClusters, W=w)
model.fit(areas)
print(model.labels_)
print()
Output (removing the azp print output):
Agglomerative
[0 1 0 0]
Azp
[1. 0. 1. 1.]
Skater
c:/Users/bburns/Desktop/moveto/pipeline/targets/areas/clustering.py:57: OptimizeWarning: By default, the graph is disconnected! Increasing `n_clusters` from 2 to 4 in order to account for islands.
self.__model.fit(n_clusters=self.n_clusters, W=self.W, data=X)
[0 1 2 3]
Not sure what's up with skater's results - haven't looked into it very much yet.
I think skater uses the W
object from pysal, and only forces it to be binary, and then only uses the sparse matrix. That'd be real simple to adjust directly in the api, using a connectivity=
keyword argument.
In my own research code, I've simply moved to using sparse matrices for input, rather than W objects, but idk how @sjsrey or other new maintainers feel about that.
it would be good to provide a consistent api for all the clustering algos in Region so that the signatures are the same and mimic scikit-learn
currently, the
.fit
method aliases.fit_from_scipy_sparse_matrix
but it might make more sense to instead aliasfit_from_w
which has a similar signature to scikit's agglomerative w/ constraintsthere's also some inconsistency in the parameter names (e.g.
n_clusters
in SKATER,n_regions
in maxp)