k-means clustering on an array of doubles

pavlexander commented 6 years ago

Hi,

I could not find any documentation, on - how an array of doubles (or a similar data set) could be clustered using k-means. For example, if I were to find 2 clusters in data:

        double[][] test = new[]
        {
            new double[] { 1, 3 },
            new double[] { 1.1, 2.9 },
            new double[] { 0.9, 2.85 },
            new double[] { 0.9, 4 }
        };

How would the code then look like?

The expected result is that arrays on indexes 0, 1, 2 are in cluster 1, whereas index 3 data is in cluster 2. This library seems to be the only one, that supports CUDA, so I am looking forward to executing more tests!

Presumably, I will have a data-set consisting of 600,000 records and ~10k clusters, and it takes ages to execute k-means clustering algorithm on CPU.

Any help is appreciated.

BR

pavlexander commented 6 years ago

Alright!

Seems like I was able to do it. Here is the code, for the people that have the same problem, to see:

float[][] rawData = new[]
{
    new float[] { 1, 3 },
    new float[] { 1.1f, 2.9f },
    new float[] { 0.9f, 2.85f },
    new float[] { 0.9f, 4 }
};

using (var lap = BrightWireProvider.CreateLinearAlgebra())
{
    List<IVector> vectorList = new List<IVector>()
    {
        lap.CreateVector(rawData[0]),
        lap.CreateVector(rawData[1]),
        lap.CreateVector(rawData[2]),
        lap.CreateVector(rawData[3])
    };

    var cluistering = vectorList.KMeans(2);
}

A few more clarifications.

Is it so, that in order to perform execution on GPU, I would then need to use BrightWireGPUProvider?
If I were to recompile .cu in CUDA 9.2project, instructions 60/60 - would I then get a significant advantage over default ptx/cubin? And would the code work at all? My GPU supports latest instructions just fine.
Continuation of previous question. I think that for this to work I would then need to use a ManagedCuda of a latest version. I can see that the latest version in NuGet is _80, whereas there were some updates in master repository, adding Cuda 9.x support. Can you confirm this to be the case? I don't mind rebuilding the BrightWire library with the latest ManagedCuda library, if it leads to better results.. But I don't know if there is a point in it at all.
Do I need to consider to normalize the data, which is already in numerical format? If so, does then BrightWire provides any kind of out-of-the-box methods for this? See my float array for the reference - that's what kind of data I am talking about. I am not interested in normalizing categories or text, so to say.

pavlexander commented 6 years ago

I could not wait for replies and just decided to go straight to testing. In my tests I have tested the time it is needed to execute the clustering of 2k vectors, 100 clusters:

using (var lap = BrightWireProvider.CreateLinearAlgebra()) // cpu
//using (var lap = BrightWireGpuProvider.CreateLinearAlgebra(cudaKernelPath: @"C:\Users\xxx\\kernel.ptx")) // CUDA 9.2 compiled kernel, compute_62,sm_62;
//using (var lap = BrightWireGpuProvider.CreateLinearAlgebra()) // default kernel
{
    List<IVector> vectorList = Load2kVectors();
    var clustering = vectorList.KMeans(100);
}

The results are:

CPU time: 10419 ms
GPU (Cuda 9.2, compute_62,sm_62): 45293 ms
GPU (default PTX): 48500 ms

My GPU by no means is slowest part of the PC, but, as you can see - CPU beast GPU by 5x margin.. The difference increases as I increase vector amount. Why is that?

P.s. I din't have CUDA 8.0 installed, so I had to recompile the BrightWire libraries with the latest ManagedCuda + CudaBlas + CudaSolve. P.p.s My PC specs are: Intel 8700k, GeForce 1080 11Gb Ti

EDIT:

I just now tried yet another library, the processing time is even lower for CPU (there is no GPU version)

double[][] mdimArray = Load2kRecordsDoublesArray();
KMeans kmeans = new KMeans(k: 100);
var clusters = kmeans.Learn(mdimArray);
int[] labels = clusters.Decide(mdimArray);

CPU (Accord.MachineLearning): 4853 ms

I can not tell if clusters are the same. I haven't compared them. And I don't want to start X vs Y discussion here. I admire the work that the author did! So far, for my particular k-clustering problem, seems like BrightWire library is not going to work.. Not while it is slow. I would like to ask whether there is something I can do to make it work faster or not..

jdermody commented 6 years ago

Hi

Thank you for your detailed and informative questions.

I profiled the GPU version to find out why it was taking so long. What I found is that it could be made faster (if you look you will find!) but not as fast as the CPU version for small vector sizes. With the code below, I found that the GPU version on my machine outperformed the CPU version when the vector size was more than a few thousand elements. But for smaller vectors the CPU version ran faster.

using (var lap = BrightWireGpuProvider.CreateLinearAlgebra()) {
    var rand = new Random();
    var list = new List<IVector>();
    Console.Write("Loading...");
    const int VECTOR_COUNT = 1024, VECTOR_SIZE = 2048;
    for (var i = 0; i < VECTOR_COUNT; i++) {
        var vector = lap.CreateVector(VECTOR_SIZE, j => (float)rand.NextDouble());
        list.Add(vector);
    }
    Console.WriteLine("done");

    var stopwatch = new Stopwatch();
    stopwatch.Start();
    var clusters = list.KMeans(50);
    Console.WriteLine(stopwatch.ElapsedMilliseconds);
}

This seems to be a standard result - the performance boost you get from the GPU only applies when processing large amounts of data (when the speed improvements of parallel processing outweigh the cost of copying data to and from the device).

As for your other question - currently BW supports normalisation within data tables. But it might be good to extend this to a list of vectors. However in the case of kmeans with euclidean distance I would not expect that normalization will change the results.

I'm curious though why the Accord version ran so much faster and will look into this further...

jdermody commented 6 years ago

It turned out that the kmeans++ implementation wasn't caching results but calculating the same distances again and again. Making this change improves the performance substantially when K is large. Changes merged with master and will be available in the forthcoming 2.1 release

jdermody / brightwire

k-means clustering on an array of doubles #21