canopyKmeans_improved

This is an implementation of the paper on "Improved K-means algorithm based on density Canopy". The repo comes with a

requirements.txt for downloading possible dependencies (pip install -r requirements.txt)
kmeans.cfg configuration file in which you can define the specs of the algorithm you want to run

When you define what you want to run in the configuration file you just run the MainLauncher.py file.

NOTE: Don't worry about some Warnings that you may get in runtime.

Concerning the configuration file in the data part:

datasdir: the directory which contains the datasets
dataset: the name of the dataset with the .arff or .csv extension, which is in the same directory as this file
algorithm: the algorithm or the algorithms you want to run separated by a dash (-) with no spaces e.g 1-2-3. Each algorithm corresponds to a number
- 1: Density Canopy-Kmeans (MyCanopyKMeans) [from the aforementioned paper]
- 2: StandardCanopyKMeans
- 3: MyKMeans
- 4: SklearnKMeans
- 5: MyKMeans++
- 6: SklearnKMeans++
- 7: MyKMedoids
- 8: MyFuzzyCMeans

Concerning the configuration file in the clustering part:

normalized: boolean, to define if preprocessing should normalize the data or not (true/false)
canopyT1: a float, indicating the T1 parameter of the standard canopy algorithm
canopyT2: a float, indicating the T2 parameter of the standard canopy algorithm
remove_outliers: boolean, to define if density canopy should remove the outliers or not (true/false)
k: the number of clusters
tol: the tolerance for the convergence
max_rep: the number of maximum repetitions
kmeans_init_type: the type of initializing the centroids. The possible values are:
- random: for getting random numbers following the uniform distribution
- kmeans++: for applying KMeans++ algorithm for the initial centroids
- canopy: for specifying that there are some centroids defined
run: the way you want to run the algorithms. The possible values are:
- algorithms: for getting the indexes values for a specific k
- silhouette: calculating the silhouette coefficient fir 15 different k and then it plots also the graph of best-k

jforjohn / canopyKmeans_improved