Closed ceecer1 closed 1 year ago
Thanks for the validation link. I am very much curious about the parallel training capability, like divide and conquer for big datasets, and merge back the parallelly obtained models into a single model.
Smile is fully threaded and leverages advanced algorithm for k-means. If you can load your data into memory, Smile should be able to handle it. If the data is too big to fit into memory, however, it doesn't make sense in mathematics if simply splitting the data and merging the results. There is a new algorithm designed for large data https://proceedings.neurips.cc/paper/2011/file/52c670999cdef4b09eb656850da777c4-Paper.pdf But Smile doesn't support it (yet).
Alternatively, we suggest you to look into the algorithms in smile.vq
package. The algorithms such as BIRCH
, SOM
, etc are online learning algorithms, which process data one by one and thus don't require full data in the memory. They produce results similar to k-means.
With version 3.0.0, it is easy to get completely lost on the fundamentals of model merging, validation and serialisation, please clarify these small steps in a simple ReadMe doc, a sample Kmeans clustering ML model training, merging, validation and serialisation should be fine.