haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
5.97k stars 1.13k forks source link

Need ReadMe guide on model training, model merging, model validation and model serialisation #737

Closed ceecer1 closed 1 year ago

ceecer1 commented 1 year ago

With version 3.0.0, it is easy to get completely lost on the fundamentals of model merging, validation and serialisation, please clarify these small steps in a simple ReadMe doc, a sample Kmeans clustering ML model training, merging, validation and serialisation should be fine.

haifengl commented 1 year ago

http://haifengl.github.io/validation.html

ceecer1 commented 1 year ago

Thanks for the validation link. I am very much curious about the parallel training capability, like divide and conquer for big datasets, and merge back the parallelly obtained models into a single model.

haifengl commented 1 year ago

Smile is fully threaded and leverages advanced algorithm for k-means. If you can load your data into memory, Smile should be able to handle it. If the data is too big to fit into memory, however, it doesn't make sense in mathematics if simply splitting the data and merging the results. There is a new algorithm designed for large data https://proceedings.neurips.cc/paper/2011/file/52c670999cdef4b09eb656850da777c4-Paper.pdf But Smile doesn't support it (yet).

haifengl commented 1 year ago

Alternatively, we suggest you to look into the algorithms in smile.vq package. The algorithms such as BIRCH, SOM, etc are online learning algorithms, which process data one by one and thus don't require full data in the memory. They produce results similar to k-means.