This software is dervied from Professor Wei-keng Liao's parallel k-means clustering code obtained on November 21, 2010 from http://users.eecs.northwestern.edu/~wkliao/Kmeans/index.html (http://users.eecs.northwestern.edu/~wkliao/Kmeans/simple_kmeans.tar.gz).
With his permission, I am publishing my CUDA implementation based on his code under the open-source MIT license. See the LICENSE file for more details.
For starters, run the benchmark.sh script to see how fast this code runs.
It's pretty fast! Depending on your hardware, data set, and k, you should see dramatic improvements in performance over CPU implementations.
On an 8-core 2.4 GHz Intel Xeon E5620 machine with an NVIDIA Tesla C1060 card, the CUDA implementation runs almost 50 times faster than the sequential version on the color17695.bin data set (for k = 128)!
Here's some sample output for the same data set on an 8-core 2.67 GHz Intel Core i7 920 machine with an NVIDIA GeForce GTX 480 card. For k = 128, the speedup is over 75x!
The CUDA implementation may need some tweaking to work with larger data sets, but the basic functionality is there. I've optimized the code with the general assumption that k and the number of clusters will be relatively small compared to the number of data points.
In fact, you may run into problems if k is too big or if your data has a large number of dimensions because there won't be enough space in block shared memory to hold all the clusters (the C1060 and GTX 480 have 16 and 48 KiB of block shared memory, respecitvely). If you hit this limitation, you should be able to get around it easily. Do the following:
1) Run 'make clean' 2) Edit the Makefile. Find the line at the top of the file that looks like this:
CFLAGS = $(OPTFLAGS) $(DFLAGS) $(INCFLAGS) -DBLOCK_SHARED_MEM_OPTIMIZATION=1
3) Set -DBLOCK_SHARED_MEM_OPTIMIZATION=0 4) Run 'make cuda' and try again
Please don't hesitate to contact me with any questions you may have. I'd love to help you out if you run into a problem. Of course, the more information you give me about your CUDA hardware and your data set (number of data points, dimensionality, number of clusters), the more helpful I can be.
The original README, with some additions, is reproduced below.
Cheers!
Serban Giuroiu http://serban.org
Parallel K-Means Data Clustering
The software package of parallel K-means data clustering contains the followings:
To compile: Although I used Intel C compiler, icc, version 7.1 during the code development, there is no particular features required except for OpenMP. Thus, the implementation should be fairly portable. Please modify Makefile to change the compiler if needed.
You will need the NVIDIA CUDA toolkit, which contains nvcc, to build the CUDA version. It works fine in concert with gcc.
To run:
The Makefile will produce executables o "omp_main" for OpenMP version o "mpi_main" for MPI version o "cuda_main" for CUDA version o "seq_main" for sequential version
The list of available command-line arguments can be obtained by running -h option o For example, running command "omp_main -h" will produce: Usage: main [switches] -i filename -n num_clusters -i filename : file containing data to be clustered -b : input file is in binary format (default no) -n num_clusters: number of clusters (K must > 1) -t threshold : threshold value (default 0.0010) -p nproc : number of threads (default system allocated) -a : perform atomic OpenMP pragma (default no) -o : output timing results (default no) -d : enable debug mode
Input file format: The executables read an input file that stores the data points to be clustered. A few example files are provided in the sub-directory ./Image_data. The input files can be in two formats: ASCII text and raw binary.
Output files: There are two output files:
Limitations:
Wei-keng Liao (wkliao@ece.northwestern.edu) EECS Department Northwestern University
Sep. 17, 2005