AnGWar26 commented 3 years ago

These functions determine the best cluster algorithm with a given set of parameters.

I wrote some quick functions that did this for something I was doing at work, so I decided to clean it up a little bit and see what you thought of them for a PR. A possible pitfall here is that running these takes a pretty long time on bigger datasets, particularly with clusterers like affinity_propagation.

Let me know what you think!

codecov[bot] commented 3 years ago

Codecov Report

Merging #277 into master will decrease coverage by 12.01%. The diff coverage is 6.25%.

@@             Coverage Diff             @@
##           master     #277       +/-   ##
===========================================
- Coverage   81.58%   69.57%   -12.02%     
===========================================
  Files          12       12               
  Lines        1184     1216       +32     
===========================================
- Hits          966      846      -120     
- Misses        218      370      +152

Impacted Files	Coverage Δ
geosnap/_community.py	`67.06% <6.25%> (-7.56%)`	:arrow_down:
geosnap/io/storage.py	`24.66% <0.00%> (-70.67%)`	:arrow_down:
geosnap/analyze/dynamics.py	`61.29% <0.00%> (-38.71%)`	:arrow_down:
geosnap/_data.py	`76.09% <0.00%> (+2.92%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 158315c...3642d78. Read the comment docs.

knaaptime commented 3 years ago

this is cool, thanks for opening it! This is a great idea so let's think about how best to generalize and consider a few other issues

there are other parameters we might want to optimize (k, W, variable subset, algorithm-specific params like damping, etc)
there are other metrics besides silhouette score that could be optimization targets (e.g. gap statistic or BIC for a mixture model, or others)
sometimes there are reasonable theoretical justifications for choosing a particular clustering algorithm over another (or k, W, etc), so it would be good to provide an interface where users can choose which params are fixed, and which others should be optimized according to some statistic.

i think in geosnap, we want to make these options pretty simple, then let the software come back with the answer. On the backend, though, i think spopt is a better target for most of the heavy lifting to actually be implemented. This would follow the same model we use with tobler, where the real interpolation functionality is provided there, and we wrap some convenience functions in geosnap.

This is all in line with some of the enhancements we wanted to target over the next several months (to both packages), so lets loop in @jgaboardi and @xf37

AnGWar26 commented 3 years ago

Just spitballing here, as my exposure to spopt's structure is minimal, and I don't necessarily know what the vision is on what should be included in spopt vs what should be in geosnap.

It seems to me that there is a question of whether we want one function to do all sorts of optimizations, or many functions to do separate types of optimizations. Do we want to create best_cluster_weights, best_cluster_vars and so on to determine the best of each type? Or do we want best_cluster to be able to determine the best variable subset, algorithm, and algorithm parameters?

My concern with the first approach is creating a behemoth of a function that takes 6 hours to run before it can return its results, and my concern with the second approach is that we lose functionality this way (separating functionality means we can't test parameters that interact with each other).

there are other parameters we might want to optimize (k, W, variable subset, algorithm-specific params like damping, etc)

We could create an interface like affinity_propagation_kws, kmeans_kws and so on, so that the **kwargs for each algorithm can be reached from the function. Inside of each of these, we could allow a k and W object to be passed, as well as control of each algorithm's specific parameters. The variables set in these interfaces could also be treated as the fixed variables, and the other algorithm variables could be dynamically tested. We could also try to create functionality that allows the user to determine what range these algorithm specific parameters should be tested within from these interfaces.

As far determining the best variable subsetting, we could also create logic that iterates over a list of columns that the user wants to test, and return the results of each iteration.

there are other metrics besides silhouette score that could be optimization targets (e.g. gap statistic or BIC for a mixture model, or others)

We could create a parameter optimization_target that takes the arguments, silhouettes,path_silhouettes, gap_statistic, etc. We could also have this be set on a case by case basis from the interface mentioned above.

Thanks for looking into this.

knaaptime commented 2 years ago

this will need to be reworked a bit to fit into the new structure, so im going to close it for now, but will definitely return to it. thanks for your work on this andrew

oturns / geosnap

add best_cluster and best_cluster_spatial #277

Codecov Report