Hey, @mthh.
Loved the experience using JenksPy, especially the model-like interface by @yasirroni. My use case involved trying to figure out the ideal class size (similar to this) and I had some code on hand, so thought it might be a good idea to integrate it into the package. I'm not sure if there is a methodological precedence to using elbow charts based on goodness of variance fit to do so, but the added convenience should be nice for people who need it.
At a high level, the elbow_chart function takes in the data array, a lower bound (default to 2) and an upper bound. It then gets the GVF values for each class size between the bounds (inclusive of both) and returns an elbow chart along with the results.
How Function Operates
Checks and confirms both bounds are of int datatype, and also checks if upper bound is greater than lower bound. Didn't validate any further to prevent redundancy since the function later triggers validate_input for each n_class value.
Pre-allocate lists to store results instead of appending via for loop. Not strictly necessary for performance, but could help for large n_class values.
Loop over each n_class value and obtain results.
Convert disparate results into dictionary to return later.
Generate plot based on results.
Return Tuple containing plot and results dictionary.
Changes Made
Import Tuple from Typing for function return typehint in core.py.
Import matplotlib for plot generation in core.py.
Add function elbow_chart at end of core.py.
Import elbowchart in __init_\.py.
Add elbowchart in __all__ in __init.py_\.
Things to Discuss
I'm not thrilled about the added matplotlib dependency. We have a few options:
We could throw an error instructing the user to install matplotlib if they want to use the function (as seen here)
Add an optional tag when installing that either contains [core] or [all]
I breifly considered using something like tkinter (which seems to be built-in) and even ASCII art, but clearly, they're not that feasible.
Honestly, I reckon most people who want to use this package will be the type that has matplotlib installed already, but I am curious as to what you think. Any other alternatives we can employ here?
Do we need a default value for upper_bound? I don't have a strong opinion on this one, could go either way.
Apologies for the long message, but wanted to ensure clarity. I also have ideas about integrating parallel processing in the for loop for additional performance gains, and also a sampling strategy (as mentioned here), but I think that can wait for another PR.
It has been a long time since I make contribution here. But I will give two opinions based on my experience.
You should not optimize number of group, except you are optimizing in your sample data and later use the number found for the larger original data. As you already know, this is due to the upper (or lower) bound, especially for normalization.
Hey, @mthh. Loved the experience using JenksPy, especially the model-like interface by @yasirroni. My use case involved trying to figure out the ideal class size (similar to this) and I had some code on hand, so thought it might be a good idea to integrate it into the package. I'm not sure if there is a methodological precedence to using elbow charts based on goodness of variance fit to do so, but the added convenience should be nice for people who need it.
At a high level, the elbow_chart function takes in the data array, a lower bound (default to 2) and an upper bound. It then gets the GVF values for each class size between the bounds (inclusive of both) and returns an elbow chart along with the results.
How Function Operates
Changes Made
Things to Discuss
Apologies for the long message, but wanted to ensure clarity. I also have ideas about integrating parallel processing in the for loop for additional performance gains, and also a sampling strategy (as mentioned here), but I think that can wait for another PR.
Example Usage