oturns / geosnap

The Geospatial Neighborhood Analysis Package
https://oturns.github.io/geosnap-guide
BSD 3-Clause "New" or "Revised" License
247 stars 32 forks source link

consider storing cluster metadata on a `Community` #144

Closed knaaptime closed 4 years ago

knaaptime commented 5 years ago

currently when a user calls Community.cluster the method returns a new community with cluster labels appended as a new attribute on the gdf. But it's common for users to try out several clustering methods, algorithms, k parameters, etc and if each of those explorations is sourced from the same Community, it might be nice to attach the metadata along with each new cluster instance.

Currently, this is handled by allowing the user to return both the underlying cluster instance and the new community by passing return_cluster=True. That's a flexible way for the user to keep both the community and the metadata behind the cluster labels, but said user needs to handle that data management herself. It might be nice to include a dict or something as an attribute of the community that handles some of this

weikang9009 commented 5 years ago

A dict as an attribute of the Community class is a great strategy to store the metadata.

knaaptime commented 4 years ago

what should we call this attribute? and what do we want to store?

what about something like models which is a dict that stores a model instance keyed on the model name. models is maybe too generic if we're only storing clusters. We could use clusters (though we already have cluster and cluster_spatial methods)

so like,

columbus = columbus.cluster(columns=['median_household_income', 'p_poverty_rate', 'p_edu_college_greater', 'p_unemployment_rate'], method='ward')

would store a new entry in clusters, so if you did

columbus.clusters['ward']

you'd get back the fitted sklearn.cluster.Ward instance. The column name always matches the key in Community.clusters

knaaptime commented 4 years ago

took a shot at this in #158, which adds a models attribute to the Community. Right now it only stores clusters, but it would make sense to do the same thing with sequence and transition models i think? The sequence models in particular could probably adopt the same convention

knaaptime commented 4 years ago

currently, this is storing a namedtuple with X, labels, column names, and W (if there is one). In short, everything you need for a silhouette or geosilhouette score, and the colnames so you can keep track of which model is which

knaaptime commented 4 years ago

oh and the model instance itself

knaaptime commented 4 years ago

resolved by #158