Open pseudotensor opened 5 years ago
Direct handling of special features would also be good. E.g., internal representation of one-hot encoded features. So the column type would be one-hot-encoded-str, one-hot-encoded-cat, etc. as an apply operation. This would never be materialized as dense view even if one went to scipy sparse representation. Only if user requested to_numpy() or to_pandas() would need to be dense again.
Many data sets can be extremely sparse, and currently datatable has no form of compression. Even a simple compression or handle of sparsity would make dt much more efficient. Some datasets can't even be on disk or memory at all because they are so large in dense representation.
Then DAI could treat this as sparse, do special sparse aware engineering if using mostly dt for operations, and then pass along a scipy sparse representation to xgboost, lightgbm, scikit routines.