src-d / kmcuda

Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA
Other
784 stars 144 forks source link

passing samples data to kmeans_cuda #51

Closed pavlexander closed 5 years ago

pavlexander commented 5 years ago

Hello,

A question about samples data type. The k-means method signature is following:

KMCUDAResult kmeans_cuda(
    KMCUDAInitMethod init, const void *init_params, float tolerance, float yinyang_t,
    KMCUDADistanceMetric metric, uint32_t samples_size, uint16_t features_size,
    uint32_t clusters_size, uint32_t seed, uint32_t device, int32_t device_ptrs,
    int32_t fp16x2, int32_t verbosity, const float *samples, float *centroids,
    uint32_t *assignments, float *average_distance)

where samples data is represented by 3 parameters:

as I understand:

Then, what exactly is const float *samples? I can see that it is of type pointer/array. Does that mean then that this is a single dimensional array? Where did the second dimension go?

For example, if on host machine I have following array of samples (samples_size = 2, features_size = 3):

[
   [1, 2, 3],
   [9, 10, 11],
]

Does that then means, that if I want to pass this array to kmeans_cuda method - I need to flatten it into single dimension? i.e.:

[1, 2, 3, 9, 10, 11]

Or can you actually pass a 2D array, to parameter of type const float *?

pavlexander commented 5 years ago

Also, what is the dimensionality of centroids array? I can see from the code that by default in R.cc it is being initialized as:

  auto centroids = std::unique_ptr<float[]>(new float[clusters_size * features_size]);

Is it a 1D array then?

pavlexander commented 5 years ago

I think I understand what is happening. For pointers there is no such thing as dimensionality. For example: auto centroids = std::unique_ptr<float[]>(new float[clusters_size * features_size]); Here we just allocate some memory where the values will be stored. The pointer will iterate over these memory cells and push data into it. It's up to a host process, to then interpret these values either as a 2D or 1D array or something else completely.

In C#, we have an M-dim array as well, which can be iterated in a single loop:

float[,] testSet_matrix = new float[3, 4]
{
    {1  , 2 , 3 , 4 },
    {9  , 10, 11, 12},
    {17 , 18, 19, 20},

};
foreach (var testVal in testSet_matrix)
{
    Console.WriteLine(testVal);
}

This just prints all values sequentially. just as a pointer would. However, we can also access each element separately as this [0,0], [0,1] etc.. That's where I got confused.

I will close this thread tomorrow, just in case some comments will follow.