pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.08k stars 1.94k forks source link

Add support for sparse Series #8777

Closed david-256 closed 1 year ago

david-256 commented 1 year ago

What

I want a Series, which values are mostly zero, to require less main memory.


Why

Most columns of the dataset for my machine learning project are categorical (e.g.: customer id for an order).

I turn each of these categorical columns into many columns using one-hot-encoding (via the to_dummies method).

This causes the DataFrame to be much bigger than it should be.


Example API

This

data.to_dummies(columns=categorical_columns, sparse=True)

should create a DataFrame with a sparse Series for each distinct value in categorical_columns.


Similar implementations

Pandas supports sparse data structures.

The above example is inspired from the Pandas get_dummies method.

lucazanna commented 1 year ago

I took a quick look at this and just realised that .to_dummies returns a datatype of pl.UInt8.

As the result of .to_dummies can be either 0 or 1, would it make sense to use a pl.Boolean data type instead?

That could reduce the size of the columns by around 8 times. Unless there are other reasons not to use a Boolean data type?

mcrumiller commented 1 year ago

@lucazanna see #8555. To quote @ritchie46:

As to_dummies will almost always be used in a machine learning algorithm which cannot deal with bitpacked data, this would lead to a redundant copy. Whereas uint8 matches the binary representation of all numerical libraries.

lucazanna commented 1 year ago

@lucazanna see #8555. To quote @ritchie46:

As to_dummies will almost always be used in a machine learning algorithm which cannot deal with bitpacked data, this would lead to a redundant copy. Whereas uint8 matches the binary representation of all numerical libraries.

got it. I had missed that one. Thanks for sharing @mcrumiller.

I looked up Arrow support for Sparse tensors and only the C++ implementation seems to have it, and it's experimental for now https://arrow.apache.org/docs/cpp/api/tensor.html

david-256 commented 1 year ago

Once the Rust implementation of Arrow supports sparse tensors, is there anything that would speak against supporting it in Polars?

I would like to be able to create a TensorFlow SparseTensor from a sparse Polars tensor.

ritchie46 commented 1 year ago

Yes, a lot of added complexity and bloat. All operations would need be added for this specific array and all numeric datatypes.

Polars is not a tensor library, nor does it aim to be.

michelkluger commented 9 months ago

Yes, a lot of added complexity and bloat. All operations would need be added for this specific array and all numeric datatypes.

Polars is not a tensor library, nor does it aim to be.

any way to operate like in pandas? https://pandas.pydata.org/docs/user_guide/sparse.html

for us there is value in having numpy like data, but having the Metadata in indexes and columns, can we adopt polars, and if yes, how?