I have a fairly large dataset with close to a million features and about a hundred thousand samples.
I need to use a RandomForest or GradientTreeBoost model to solve a classification problem using this dataset.
Using a dense double[][] implementation, my dataset uses more than 64GB which is beyond my budget. However, the features are sparse. Using SparseDataSet, I verified that my dataset fits in memory < 32 GB, which fits within my budget. So I would like to use this approach.
However, all the fit methods in smile.classification.RandomForest and smile.classification.GradientTreeBoost only accept a DataFrame as an input.
My question is: How does one convert a SparseDataSet into a DataFrame to pass to these fit methods?
Thanks a lot!
I have a fairly large dataset with close to a million features and about a hundred thousand samples. I need to use a RandomForest or GradientTreeBoost model to solve a classification problem using this dataset. Using a dense double[][] implementation, my dataset uses more than 64GB which is beyond my budget. However, the features are sparse. Using SparseDataSet, I verified that my dataset fits in memory < 32 GB, which fits within my budget. So I would like to use this approach. However, all the fit methods in smile.classification.RandomForest and smile.classification.GradientTreeBoost only accept a DataFrame as an input. My question is: How does one convert a SparseDataSet into a DataFrame to pass to these fit methods? Thanks a lot!