src-d / kmcuda

Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA
Other
797 stars 146 forks source link

K-means in C# #47

Closed pavlexander closed 6 years ago

pavlexander commented 6 years ago

Seeing that there is an API made for R# - would it then be possible to use this library in C# as well?

In C# we have a managedCuda library that can leverage tasks to GPU, so I was wondering whether this is already possible or is planned, to adapt kmcuda to C# as well.

Currently, it takes me days to clusterize data on CPU..

pavlexander commented 6 years ago

If it's going to be of any value - I can provide some examples (CUDA kernel + C# code) - how the managedCuda library can be used. Variables transferred from/to device etc..

I have tried to understand the code that you have in sources, but it makes little to no sense, so perhaps I am lacking lot's of knowledge to do it myself.. :( Though, if someone from kmcuda dev team would have taken on this task - I could then try to assist you guys by all means possible.

vmarkovtsev commented 6 years ago

Hi!

Last time I did C# was .NET 3.5 ~8 years ago hehe, also my team is very far away from any MS things except VSCode so we are not going to work on .NET bindings in the future. PRs are welcome as always though!

Back then when I did work with .NET I had P/Invoke to call any native function from C#. KMCUDA is 3 functions to cluster and 3 functions to do kNN, so should be extremely easy to wrap them. The declarations are in kmcuda.h. None CUDA .NET bindings are required.

pavlexander commented 6 years ago

Good day,

You have suggested an approach which I have not considered before.

I was talking about rewriting C++ code base into a C# analogue, which will then execute the CUDA kernel. But you are suggesting invoking C++ library itself, which will do all the logic, so the C# code doesn't have to care about any implementation at all.

I haven't done it before, but a quick googling showed me that this should be doable. I will have to once again review sources of your solution and see if I can write C++ wrapper on top of it. If you don't mind - please keep this topic opened for a while. I will post an update as soon as I have concluded something.

One thing that keeps me worrying at this point - are there limitation to how much data I can transfer to unmanaged library? To say, if I have 500000 samples to proccess, weighting 5 gigabytes - will I then have any problems passing them all as an argument to kmeans_cuda function? If there are no such limitations, then that's a green light.

vmarkovtsev commented 6 years ago

There should be no limitations because the memory space is shared.

pavlexander commented 6 years ago

@vmarkovtsev a question.

Once you have processed data (i.e. ran kmeans_cuda method) - how do you test a new set of data against previous results? As I understand results are represented by a combination of centroids, assignments and average_distance.

The reason for asking is this:

I have just found out that you can not use CLI wrappers in .Net Core application. I need to be able to test data against result-set on a Linux machine, so .Net Framework is not an option.

Therefore, as an workaround - I can try running all necessary calculations on Windows machine, but then, the result shall be preserved. The problem is that I have no idea how the new data can be tested against this result. Whether I yet again have to use a CLI for that or perhaps a test function to can be easily written in C# so I can loose CLI/kmcuda dependency..

Any clarification is welcomed.

vmarkovtsev commented 6 years ago

Not sure what you mean by testing a new set of data against the previous results. Test as in software unit testing or as in Machine Learning model testing?

What do you call a ClI (command line interface)?

My current understanding of your post is that you want to compare the centroids and assignments on several platforms to ensure that your c# wrapper works correctly. In this case, dump both to binary files and use cmp - also pin the random seed. Or use them as test fixtures.

pavlexander commented 6 years ago

or as in Machine Learning model testing - this. Let's say, there are 100 samples of data for which we determined 5 clusters. So now - I am willing to determine - to which of 5 clusters does a new sample belongs to. Is there a method or a way to accomplish that with kmcuda library?

At least, that's how libraries I worked with - operate. As an example, here is how clustering is performed in one of the libraries:

double[][] observations =
{
    new double[] { -5, -2, -1 },
    new double[] { -5, -5, -6 },
    new double[] {  2,  1,  1 },
    new double[] {  1,  1,  2 },
    new double[] {  1,  2,  2 },
    new double[] {  3,  1,  2 },
    new double[] { 11,  5,  4 },
    new double[] { 15,  5,  6 },
    new double[] { 10,  5,  6 },
};

// Create a new K-Means algorithm with 3 clusters 
KMeans kmeans = new KMeans(3);

// Compute the algorithm
KMeansClusterCollection clusters = kmeans.Learn(observations);

var testData = new double[] { -1, -2, -1 },

// test
int clusterNr = clusters.Decide(testData);

This code holds no other reason than showing - what I understanding by training and testing. From the code I see that currently, in kmcuda there is only teaching process. What I am willing to see is testing process against new samples of data, i.e. how do I know what cluster does the new data belongs to?

Regarding What do you call a ClI (command line interface)?

CLI stands for Common Language Infrastructure. It is explained thoroughly all over the web, but in short: It is an open specification that describes the executable code and runtime environment that allows multiple high-level languages to be used on different computer platforms without being rewritten for specific architectures.

I should have instead used C++/CLI term. C++/CLI allows developers to build wrappers on top of C++ library. So, instead of using P/Invoke I can build a wrapper assembly in C++/CLI. In C++/CLI you can call into unmanaged code as if you were writing native code, but you can call into C++/CLI code from C# as if it were written in C#.

Regarding: My current understanding of your post is that you want to compare the centroids and assignments on several platforms to ensure that your c# wrapper works correctly. Please check the first part.

vmarkovtsev commented 6 years ago

Aha! So you need to calculate the assignments on a new portion of data. You call the same kmeans_cuda but with init=kmcudaInitMethodImport, tolerance=1.0, yinyang_t=0 and centroids set to the ones you've obtained after training. It effectively uses the existing centroids instead of initializing on new samples, does precisely one iteration of the k-means algorithm (without calculating the new centroids) and returns. The result will be in asignments. Not obvious, agree. I need to include it into the README for sure.

C++/CLI is still alive, wow.

pavlexander commented 6 years ago

I see.. Glad to hear that it is possible! The downside of this situation is that - the Linux I was talking about earlier - is installed on a NUC mini PC, which can never dream of fully-fledged GPU (and CUDA for same matter). Therefore, taking into the consideration what you have said earlier - it is then not going to be possible to calculate assignments on that environment, due to the lack of CUDA support (unless there is a way to force kmcuda to run on CPU). To tell you the truth I am not even sure if Linux would support CUDA at all, even if PC had a good video card :) So the only option there is to run CPU calculations.

I am sure that with a bit of tinkering I can get kmcuda to work from within C# code, on Windows machine, but it will then be impossible to run the same application on a Linux server, rendering my efforts useless in the long-term.

Quite unpleasant situation for me :) Luckily, I can come up with other ideas..

If there are still any ideas or suggestions in regards to my situation - I will gladly hear them out! But there are no new questions from my side.

Thank you for help so far!

vmarkovtsev commented 6 years ago

it is then not going to be possible to calculate assignments on that environment impossible to run the same application on a Linux server

Not sure about C#, but it takes 2 lines of Python to do it on a CPU:

  1. Multiply two matrices - samples SxD and centroids DxC
  2. Negate the resulting SxC and then depending on the distance measure, either leave as is or multiply by two and add the squares.
  3. Find the argmin along C axis, getting S assignments.
dists = -samples.dot(centroids.T)  # in case of cos measure
dists = samples.dot(samples.T) - 2 * samples.dot(centroids.T) + centroids.T.dot(centroids)  # euclidean
assignments = numpy.argmin(dists, axis=1)

So use any linear algebra lib to do the matrix multiplication efficiently and that's it.

pavlexander commented 6 years ago

That's great! So then there are no obstacles then.

I'd like to first to try to determine assignments as proof of concept, in Python and then, based on the same data - try migrating same code into C# so when I get to the last stage - there would be no issues. Here is how far I got:

import numpy

samples = numpy.array([
    [1, 2, 3], #sample 1
    [9, 10, 11],  #sample 2
    [5, 4, 6] #sample 3
    ])

print "Samples:\n"
print samples
print "Samples transposed:\n"
print samples.T

centroids = numpy.array([
    [0.89, 1.23, 1.23],  # cluster 1
    [0.11, 2.28, 5.33]  # cluster 2
    ])
print "Centroids:\n"
print centroids
print "Centroids transposed:\n"
print centroids.T

print "Distances calculation\n"

r1 = samples.dot(samples.T)
print "r1: " , r1
r2 = samples.dot(centroids.T)
print "r2: " , r2
r3 = centroids.T.dot(centroids)
print "r3: " , r3

dists = r1 - 2 * r2 + r3 # euclidean
print "dists: " , dists , '\n' 

print 'Assignments:'
assignments = numpy.argmin(dists, axis=1)
print assignments

but there is a # ValueError: operands could not be broadcast together with shapes (2,2) (3,3) error at dists calculation step. Sems like I have messed up the dimensions somehow. Three samples were provided, along with 2 predefined centroids (let's assume that we have previously ran k-means).

Per my understanding - each of 3 data samples should have been assigned either 1 or 0, representing the cluster/centroid number. i.e. at the end assignments could be following [0, 1, 1].

Could you take a look at provided sample and tell me at what point have my logic stopped to make sense?

P.s. I'm new in Python :)

pavlexander commented 6 years ago

@vmarkovtsev any updates on last question?

pavlexander commented 6 years ago

I have just came up with my own implementation. Though, it does not require matrices.

The algorithm is following for each sample of data:

Repeat for all samples of data.

Here is the function in C# for getting the distance:

        public static double GetEuclidianDistance(this float[] x, float[] y)
        {
            double sum = 0;
            double distance = 0;
            for (int i = 0; i < x.Length; i++)
            {
                sum = sum + Math.Pow((x[i] - y[i]), 2.0);
                distance = Math.Sqrt(sum);
            }

            return distance;
        }

Also, Accord.Net provides a helper method for this as well. Results from my method and Accord are the same.

var distance1 = Distance.Euclidean(single_sample, single_centroid);

I am not sure if at this point I even need matrices, though, for the sake of fulfillment I can try them as well and compare results. Of course, that is, after the error which I have described above is resolved.

vmarkovtsev commented 6 years ago

\<this is to show that I am here but have not had time to read yet>

offtopic: how about creating a pull request with your C++ fixes? I will be able to review then.

pavlexander commented 6 years ago

Once I am done with all C+/CLI wrapping and code fixes - we'll see :)

pavlexander commented 6 years ago

Success! Just seconds ago I ran kmcuda from C#! No exceptions were thrown. :) Finally some results.

uint clustersSize = 3;
float[,] samples = new float[,]
{
    { 0.9f, 2.75f }, // g1 -> .
    { 0.9f, 2.85f }, // g1 -> g2
    { 1f, 3.1f }, // g2 -> .
    { 1.1f, 2.9f }, // g1 -> g2
    { 0.9f, 4f } // g3 -> .
};

uint samplesSize = (uint)samples.GetLength(0);
ushort featuresSize = (ushort)samples.GetLength(1);

float[,] centroids = new float[clustersSize,featuresSize];
uint[] assignments = new uint[samplesSize];
float avgDistance = 0;

/*You call the same kmeans_cuda but with
 init=kmcudaInitMethodImport, tolerance=1.0, yinyang_t=0 and centroids set to the ones you've obtained after training. 
 It effectively uses the existing centroids instead of initializing on new samples, 
 does precisely one iteration of the k-means algorithm (without calculating the new centroids) and returns.
 The result will be in asignments*/
Core3Wrapper.Clusterize(
    init: KMCUDAInitMethodM.kmcudaInitMethodPlusPlus,

    // tolerance float, if the relative number of reassignments drops below this value, algorithm stops.
    // less than 1% of the samples are reassigned in the end (0.01)
    tolerance: 0.05f, //0, //0.01f,

    // Yinyang can be turned off to save GPU memory but the slower Lloyd will be used then.
    // yinyang_t float, the relative number of cluster groups, usually 0.1. 0 disables Yinyang refinement.
    yinyang_t: 0.1f, //0.1f,

    // metric str, the name of the distance metric to use. The default is Euclidean (L2), it can be changed to "cos" to change the algorithm to Spherical K-means with the angular distance.
    // Please note that samples must be normalized in the latter case.
    metric: KMCUDADistanceMetricM.kmcudaDistanceMetricL2,
    samples_size: samplesSize,
    features_size: featuresSize,
    clusters_size: clustersSize,
    seed: 777, // seed integer, random generator seed for reproducible results.
    device: 1, // 0 = use all available CUDA devices
    device_ptrs: -1, // samples are supplied from host
    fp16x2: false, // float16x2 mode
    verbosity: 2, // verbosity integer, 0 means complete silence, 1 means mere progress logging, 2 means lots of output.
    samples: samples,
    centroids: ref centroids,
    assignments: ref assignments,
    average_distance: ref avgDistance
    );

int[] assignments2 = samples.AssignClusters(centroids);

what's more - I have compared the assignments generated by kmcuda with the assignments that I got from my function and they are the same, meaning that Euclidean distance calculation step was a success as well.

Looking forward running more tests with your library. Thank you for guidance.

Log:

arguments: 1 0000000000000000 0.050 0.10 0 5 2 3 777 1 0 2 00000291A8CC39F0 00000291A8CC3A88 00000291A8CC3AB8 000000D4261FEB5C reassignments threshold: 0 yinyang groups: 0 GPU #0 memory: used 2048052429 bytes (17.3%), free 9763107635 bytes, total 11811160064 bytes GPU #0 has 49152 bytes of shared memory per block transposing the samples... transpose <<<(1, 1), (32, 8)>>> 5, 2 performing kmeans++... done too few clusters for this yinyang_t => Lloyd plans: [(0, 5)] planc: [(0, 3)] iteration 1: 5 reassignments iteration 2: 1 reassignments iteration 3: 0 reassignments calculating the average distance... return kmcudaSuccess

I needed something human-verifiable, hence there are only 3 clusters and 5 samples :)

image

pavlexander commented 6 years ago

I don't know if it's some kind of magic or an actual result, but, using this (kmcuda) library - it took me only 17 seconds to clusterize data with k-means:

Hardware is:

Output:

arguments: 1 0000000000000000 0.050 0.10 0 35210993 8 450 777 1 0 2 0000019A341E1040 00000199E14E3008 00000199E41E1030 000000138B5FE6E4 reassignments threshold: 1760549 yinyang groups: 45 reusing passed_yy for centroids_yy GPU #0 memory: used 10078440653 bytes (85.3%), free 1732719411 bytes, total 11811160064 bytes GPU #0 has 49152 bytes of shared memory per block transposing the samples... transpose <<<(1100344, 1), (8, 32)>>> 35210993, 8, xyswap performing kmeans++... done running Lloyd until reassignments drop below 3873209 plans: [(0, 35210993)] planc: [(0, 450)] iteration 1: 35210993 reassignments iteration 2: 7770534 reassignments iteration 3: 4485951 reassignments iteration 4: 3241920 reassignments transposing the samples... transpose <<<(15, 1), (8, 32)>>> 450, 8, xyswap performing kmeans++... done plans: [(0, 450)] planc: [(0, 45)] iteration 1: 450 reassignments iteration 2: 79 reassignments iteration 3: 48 reassignments iteration 4: 35 reassignments iteration 5: 25 reassignments iteration 6: 23 reassignments iteration 7: 19 reassignments iteration 8: 16 reassignments iteration 9: 9 reassignments transposing the samples... transpose <<<(15, 1), (32, 8)>>> 8, 450 plans: [(0, 35210993)] planc: [(0, 450)] plang: [(0, 45)] refreshing Yinyang bounds... iteration 5: 2590890 reassignments passed number: 26449969 iteration 6: 2173603 reassignments passed number: 33043580 iteration 7: 1883710 reassignments passed number: 34497867 iteration 8: 1671426 reassignments calculating the average distance... return kmcudaSuccess

In the past I used Accord.Net's (popular machine learning library in .Net world) k-means clustering mechanism. It took many hours to cluster data that had lower sample count and fewer features (7)! I just now started the same process again to give Accord.Net another try. For the take of experiment I will wait as much as needed, but no longer than until evening, after I get back from work.

In a meanwhile - I would appreciate to hear out some comments on my results. From the log you will see:

running Lloyd until reassignments drop below 3873209

and also:

iteration 9: 9 reassignments transposing the samples... transpose <<<(15, 1), (32, 8)>>> 8, 450 plans: [(0, 35210993)] planc: [(0, 450)] plang: [(0, 45)] refreshing Yinyang bounds... iteration 5: 2590890 reassignments passed number: 26449969 iteration 6: 2173603 reassignments passed number: 33043580 iteration 7: 1883710 reassignments passed number: 34497867 iteration 8: 1671426 reassignments

Given, that my input parameters were:

                init: KMCUDAInitMethodM.kmcudaInitMethodPlusPlus,
                tolerance: 0.05f,
                yinyang_t: 0.1f,
                metric: KMCUDADistanceMetricM.kmcudaDistanceMetricL2,
                seed: 777,
                device: 1,
                device_ptrs: -1,
                fp16x2: false,
  1. What exactly are reassignments and how do I interpret the result that you see in the log? i.e. how do I know that having 1671426 reassignments at the end iteration is good or bad?

  2. How do I use average distance to determine the best cluster size? Should it be minimal or maximal, for best results?

  3. How to make a proper use of parameters tolerance: 0.05f and yinyang_t: 0.1f? This question could be connected to Q1.

vmarkovtsev commented 6 years ago

https://github.com/src-d/kmcuda/issues/47#issuecomment-409480936

import numpy

samples = numpy.array([
    [1, 2,  3],    #sample 1
    [9, 10, 11],  #sample 2
    [5, 4,  6]     #sample 3
])

centroids = numpy.array([
    [0.89, 1.23, 1.23],  # cluster 1
    [0.11, 2.28, 5.33]   # cluster 2
])

r1 = numpy.multiply(samples, samples).sum(axis=1)[:, numpy.newaxis]
print("r1: " , r1)
r2 = samples.dot(centroids.T)
print("r2: " , r2)
r3 = numpy.multiply(centroids, centroids).sum(axis=1)
print("r3: " , r3)

# add r1 columnwise
# add r3 rowwise
dists = r1 - 2 * r2 + r3 # euclidean
print("dists: " , dists , '\n') 

print('Assignments:')
assignments = numpy.argmin(dists, axis=1)
print(assignments)

Output

r1:  [[ 14]
 [302]
 [ 77]]
r2:  [[ 7.04 20.66]
 [33.84 82.42]
 [16.75 41.65]]
r3:  [ 3.8179 33.6194]
dists:  [[  3.7379   6.2994]
 [238.1379 170.7794]
 [ 47.3179  27.3194]] 

Assignments:
[0 1 1]
vmarkovtsev commented 6 years ago

https://github.com/src-d/kmcuda/issues/47#issuecomment-409921493

Of course you can calculate everything without matrices. Matrices are needed to reach the peak performance, because the "r2" step involves matrix multiplication and it is hard to optimize manually (e.g. SIMD, threading, etc.). So my advice is to grab a .NET matrix lib and use it.

vmarkovtsev commented 6 years ago

https://github.com/src-d/kmcuda/issues/47#issuecomment-410140307

  1. Reassignments are how many samples changed the cluster. The lower the number, the better the quality. "Quality" is measured on the downstream problem - kmeans has no idea what is best for you.

  2. Google "kmeans elbow method"

  3. tolerance sets the reassignments threshold. yinyang_t is an optimization and normally should not be touched unless you hit memory problems (it is set to 0 then)

pavlexander commented 6 years ago

that's interesting.. According to "elbow method" - it is enough to have just 30 clusters for my dataset of 35 million records...

image

I am not sure if I need to believe it or not.. I will run some tests and see..

pavlexander commented 6 years ago

Though, if I make the graph narrower, then the elbow is at 100sh cluster count.

image