numere-org / NumeRe

Framework for numerical computations, data analysis and visualisation
https://www.numere.org
GNU General Public License v3.0
17 stars 5 forks source link

Table method for clustering data #202

Closed numeredev closed 1 month ago

numeredev commented 4 months ago

DESCRIPTION

What does your feature request improve on? Please describe. Clustering of data is really common and should be supported out-of-the-box.

Describe the solution you'd like Create a table method for clustering data with some options like so:

TAB().clustersof({nCols},nClusters,sMethod="k-means") -> {VAL}

The new method shall consider the columns as data dimensions (i.e. 3 columns make the data three-dimensional). The user supplies the number of target clusters via nClusters and can select the method (sMethod="k-means"). It should support at least k-means, but might also support others and can be extended in the future. The return value of the method is the assignment of each tuple to the target cluster id.

Additional context Add any other context or screenshots about the feature request here.

(Do not write below this line)


DEVS' SECTION

ANALYSIS

Instead of having a single method for all possible clustering algorithms, we'll have one method for each algorithm starting with TAB().kmeansof({nCols},nClusters,nMaxIterations). The implementation shall be within memory.cpp with an interface in dataaccess.cpp just like for static std::string tableMethod_anova(const std::string& sTableName, std::string sMethodArguments, const std::string& sResultVectorName). The method static std::string tableMethod_binsof(const std::string& sTableName, std::string sMethodArguments, const std::string& sResultVectorName) show, how to return pure numerical results.

K-Means can only run on numerical data, therefore it is important to not forget to check the column data types first.

IMPLEMENTATION STEPS

(see also our Wiki for implementation guidelines)

DOCUMENTATION STEPS

(see also our Wiki for further information)

PULL REQUEST