nicodv / kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
MIT License
1.23k stars 416 forks source link

Add L1 as a dissimilarity function option for continuous variables #169

Closed imamilion closed 2 years ago

imamilion commented 2 years ago

A k-prototypes model fit can't be pickle-saved when a user-defined dissimilarity metric is used (see my post on Stack Overflow).

It seems to me that the issue would be solved if that user-defined dissimilary metric is actually implemented in the module, next to jaccard_dissim, euclidean_dissim, etc. Thus, it would be great to have some more commonly used distance functions implemented in the package. In my personal case, I'd like to be able to use L1 (Manhattan distance).

nicodv commented 2 years ago

There is such a large variety of potential distance functions to use for numerical clustering that I prefer to leave it to the users to provide them. kmodes specializes more in categorical distance functions.

But of course, feel free to submit a PR to add the function. :)

As for the pickling error, I'm not able to reproduce it:

>>> import numpy as np
>>> from kmodes.kprototypes import KPrototypes

>>> def L1(a, b):
...    return np.sum(np.abs(a-b), axis=1)

>>> model = KPrototypes(n_clusters=20, gamma=1, num_dissim=L1, init='Cao')
>>> model

KPrototypes(gamma=1, n_clusters=20, num_dissim=<function L1 at 0x7fa526505090>)

>>> import pickle
>>> pickle.dumps(model)

b'\x80\x04\x95\xe8\x00\x00\x00\x00\x00\x00\x00\x8c\x12kmodes.kprototypes\x94\x8c\x0bKPrototypes\x94\x93\x94)\x81\x94}\x94(\x8c\nn_clusters\x94K\x14\x8c\x08max_iter\x94Kd\x8c\ncat_dissim\x94\x8c\x12kmodes.util.dissim\x94\x8c\x0fmatching_dissim\x94\x93\x94\x8c\x04init\x94\x8c\x03Cao\x94\x8c\x06n_init\x94K\n\x8c\x07verbose\x94K\x00\x8c\x0crandom_state\x94N\x8c\x06n_jobs\x94K\x01\x8c\nnum_dissim\x94\x8c\x08__main__\x94\x8c\x02L1\x94\x93\x94\x8c\x05gamma\x94K\x01ub.'

I suspect the problem lies with where you define your function. Have a look at this: https://www.pythonanywhere.com/forums/topic/27818/

A step-by-step reproducible example would help here.