Problem with max size of clusters

joshlk / k-means-constrained

K-Means clustering - constrained with minimum and maximum cluster size. Documentation: https://joshlk.github.io/k-means-constrained

https://github.com/joshlk/k-means-constrained

BSD 3-Clause "New" or "Revised" License

192 stars 43 forks source link

Problem with max size of clusters #8

Closed Arsalan-Vosough closed 3 years ago

Arsalan-Vosough commented 3 years ago

Thank you for sharing your code! I used it to cluster my data in 10 cluster with min_size = 3 and max_size = 5. But it returns some clusters with more than max size elements unfortunately. it gives me a cluster with 7 elements sometimes.

joshlk commented 3 years ago

Hi, Great to hear that your using it 😀.

Can you please provide a minimal working example. Thanks, Josh

Arsalan-Vosough commented 3 years ago

Longitude Latitude 0 0.143799 0.549696 1 0.748523 0.666809 2 0.893091 0.485969 3 0.633522 0.273117 4 0.691772 0.763385 5 0.671481 0.112269 6 0.250957 0.781550 7 0.199018 0.798926 8 0.680017 0.201779 9 0.270592 0.461235 10 0.648789 0.140139 11 0.417517 0.114667 12 0.733276 0.254028 13 0.283617 0.515177 14 0.256486 0.788757 15 0.369168 0.380070 16 0.265186 0.596243 17 0.356121 0.442192 18 0.651694 0.876345 19 0.166674 0.829551 20 0.623306 0.034364 21 0.250798 0.911847 22 0.448605 0.517670 23 0.529576 0.000000 24 0.622372 0.215839 25 0.492679 0.621276 26 0.349826 0.242467 27 0.561980 0.855117 28 0.543573 1.000000 29 0.000000 0.572787 30 0.285501 0.358724 31 0.398475 0.106590 32 1.000000 0.452500 33 0.367203 0.419650 34 0.672594 0.257735 35 0.590781 0.022893 36 0.459228 0.146675 37 0.480092 0.666456 38 0.451271 0.225341 39 0.767639 0.395854 40 0.702797 0.589130

this is my data, I normalized it with minmax scaler. and used this function in order to clustering :

def k_means_cons(k,minVal,maxVal,data):

clf = KMeansConstrained(
    n_clusters=k,
    size_min=minVal,
    size_max=maxVal,
    random_state=0,max_iter = 300)
clf.fit(data)

clf.cluster_centers_

Label = clf.predict(data)
return Label

Label = k_means_cons(10,3,5,normalized)

and it returns me:

array([0, 9, 7, 1, 9, 8, 5, 5, 1, 4, 8, 3, 1, 4, 5, 4, 0, 4, 6, 5, 8, 5, 2, 8, 1, 2, 3, 6, 6, 0, 4, 3, 7, 4, 1, 8, 3, 2, 3, 7, 9])

as you can see there are 6 elements in 4th cluster

joshlk commented 3 years ago

Thanks. What exact normalisation did you use?

what sklearn and ortools version are you also using?

Arsalan-Vosough commented 3 years ago

minmax_scale = preprocessing.MinMaxScaler(feature_range=(0,1))
scaled_feature = minmax_scale.fit_transform(data)

sklearn version is 0.23.2 and ortools version is 8.1.8487

Arsalan-Vosough commented 3 years ago

i think, I made it complex. Briefly if you run the code below, sometimes it gives you cluster with more than max_size

def generatedb(numberPatient):
    patient=[]
    i = 0
    while len(patient)<=numberPatient:
        x = random.uniform(51.078418,51.701563)
        y = random.uniform(35.514715,35.901148)
        if y < -0.722386*x+72.866:
            if y > -0.7184706*x+72.576:
                if y > 0.692935*x+0.0551:
                    if y<0.549044*x+7.5495:
                        patient.append((x,y))
                        i=i +1
    dataWithcolName =  pd.DataFrame(patient,columns=['Longitude', 'Latitude'])  
    return(dataWithcolName)

def k_means_cons(k,minVal,maxVal,data):

    clf = KMeansConstrained(
        n_clusters=k,
        size_min=minVal,
        size_max=maxVal,
        random_state=0,max_iter = 300)
    clf.fit(data)

    clf.cluster_centers_

    Label = clf.predict(data)
    return Label
data0 = generatedb(38)
Label = k_means_cons(10,3,5,normalized)
Label

array([7, 9, 6, 0, 1, 0, 1, 3, 0, 5, 2, 5, 7, 2, 4, 0, 3, 5, 4, 8, 7, 0, 6, 6, 7, 4, 9, 1, 9, 4, 1, 8, 3, 9, 5, 7, 2, 1, 2, 4, 2, 8, 5, 7])

joshlk commented 3 years ago

Hi,

I determined what the issue is and it's my fault as the example on the front page of this project is wrong. So thank you for raising the issue.

So you need to use the method fit_predict instead of fit and then predict. This is because predict assigns clusters to the nearest centre without obeying the min and max constrains. While fit_predict does obey the constrains, you can also access the assigned labels using the labels_ attribute after a fit. Like I said, on the front page of this project I use fit and then predict and so this wasn't communicated properly by myself.

Currently, I would say, the predict function does not meet expectations and therefore I have changed it in the latest version so it does obey the obeying the min and max constrains. Therefore if you update to the latest version (v0.5.0) which is on PyPI it should now work.

Thanks again for reporting this, Josh

Arsalan-Vosough commented 3 years ago

Hi,

I used fit_predict and it worked.

Thanks for your quick response.