haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
5.99k stars 1.12k forks source link

How to use SparseDataSet with RandomForest and GradientTreeBoost? #694

Closed jeyendranbalakrishnan closed 2 years ago

jeyendranbalakrishnan commented 2 years ago

I have a fairly large dataset with close to a million features and about a hundred thousand samples. I need to use a RandomForest or GradientTreeBoost model to solve a classification problem using this dataset. Using a dense double[][] implementation, my dataset uses more than 64GB which is beyond my budget. However, the features are sparse. Using SparseDataSet, I verified that my dataset fits in memory < 32 GB, which fits within my budget. So I would like to use this approach. However, all the fit methods in smile.classification.RandomForest and smile.classification.GradientTreeBoost only accept a DataFrame as an input. My question is: How does one convert a SparseDataSet into a DataFrame to pass to these fit methods? Thanks a lot!

haifengl commented 2 years ago

DataFrame is dense. There is no way to store your data in this way.

jeyendranbalakrishnan commented 2 years ago

I see. Thanks. So there's no way to pass sparse data to fit Random Forest or GBM?

haifengl commented 2 years ago

no