Closed imamilion closed 2 years ago
There is such a large variety of potential distance functions to use for numerical clustering that I prefer to leave it to the users to provide them. kmodes
specializes more in categorical distance functions.
But of course, feel free to submit a PR to add the function. :)
As for the pickling error, I'm not able to reproduce it:
>>> import numpy as np
>>> from kmodes.kprototypes import KPrototypes
>>> def L1(a, b):
... return np.sum(np.abs(a-b), axis=1)
>>> model = KPrototypes(n_clusters=20, gamma=1, num_dissim=L1, init='Cao')
>>> model
KPrototypes(gamma=1, n_clusters=20, num_dissim=<function L1 at 0x7fa526505090>)
>>> import pickle
>>> pickle.dumps(model)
b'\x80\x04\x95\xe8\x00\x00\x00\x00\x00\x00\x00\x8c\x12kmodes.kprototypes\x94\x8c\x0bKPrototypes\x94\x93\x94)\x81\x94}\x94(\x8c\nn_clusters\x94K\x14\x8c\x08max_iter\x94Kd\x8c\ncat_dissim\x94\x8c\x12kmodes.util.dissim\x94\x8c\x0fmatching_dissim\x94\x93\x94\x8c\x04init\x94\x8c\x03Cao\x94\x8c\x06n_init\x94K\n\x8c\x07verbose\x94K\x00\x8c\x0crandom_state\x94N\x8c\x06n_jobs\x94K\x01\x8c\nnum_dissim\x94\x8c\x08__main__\x94\x8c\x02L1\x94\x93\x94\x8c\x05gamma\x94K\x01ub.'
I suspect the problem lies with where you define your function. Have a look at this: https://www.pythonanywhere.com/forums/topic/27818/
A step-by-step reproducible example would help here.
A k-prototypes model fit can't be pickle-saved when a user-defined dissimilarity metric is used (see my post on Stack Overflow).
It seems to me that the issue would be solved if that user-defined dissimilary metric is actually implemented in the module, next to jaccard_dissim, euclidean_dissim, etc. Thus, it would be great to have some more commonly used distance functions implemented in the package. In my personal case, I'd like to be able to use L1 (Manhattan distance).