h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures
https://datatable.readthedocs.io
Mozilla Public License 2.0
1.82k stars 157 forks source link

Handle basic sparsity #1644

Open pseudotensor opened 5 years ago

pseudotensor commented 5 years ago

Many data sets can be extremely sparse, and currently datatable has no form of compression. Even a simple compression or handle of sparsity would make dt much more efficient. Some datasets can't even be on disk or memory at all because they are so large in dense representation.

Then DAI could treat this as sparse, do special sparse aware engineering if using mostly dt for operations, and then pass along a scipy sparse representation to xgboost, lightgbm, scikit routines.

pseudotensor commented 5 years ago

Direct handling of special features would also be good. E.g., internal representation of one-hot encoded features. So the column type would be one-hot-encoded-str, one-hot-encoded-cat, etc. as an apply operation. This would never be materialized as dense view even if one went to scipy sparse representation. Only if user requested to_numpy() or to_pandas() would need to be dense again.