Closed david-256 closed 1 year ago
I took a quick look at this and just realised that .to_dummies
returns a datatype of pl.UInt8.
As the result of .to_dummies
can be either 0 or 1, would it make sense to use a pl.Boolean data type instead?
That could reduce the size of the columns by around 8 times. Unless there are other reasons not to use a Boolean data type?
@lucazanna see #8555. To quote @ritchie46:
As
to_dummies
will almost always be used in a machine learning algorithm which cannot deal with bitpacked data, this would lead to a redundant copy. Whereasuint8
matches the binary representation of all numerical libraries.
@lucazanna see #8555. To quote @ritchie46:
As
to_dummies
will almost always be used in a machine learning algorithm which cannot deal with bitpacked data, this would lead to a redundant copy. Whereasuint8
matches the binary representation of all numerical libraries.
got it. I had missed that one. Thanks for sharing @mcrumiller.
I looked up Arrow support for Sparse tensors and only the C++ implementation seems to have it, and it's experimental for now https://arrow.apache.org/docs/cpp/api/tensor.html
Once the Rust implementation of Arrow supports sparse tensors, is there anything that would speak against supporting it in Polars?
I would like to be able to create a TensorFlow SparseTensor from a sparse Polars tensor.
Yes, a lot of added complexity and bloat. All operations would need be added for this specific array and all numeric datatypes.
Polars is not a tensor library, nor does it aim to be.
Yes, a lot of added complexity and bloat. All operations would need be added for this specific array and all numeric datatypes.
Polars is not a tensor library, nor does it aim to be.
any way to operate like in pandas? https://pandas.pydata.org/docs/user_guide/sparse.html
for us there is value in having numpy like data, but having the Metadata in indexes and columns, can we adopt polars, and if yes, how?
What
I want a
Series
, which values are mostly zero, to require less main memory.Why
Most columns of the dataset for my machine learning project are categorical (e.g.: customer id for an order).
I turn each of these categorical columns into many columns using one-hot-encoding (via the
to_dummies
method).This causes the DataFrame to be much bigger than it should be.
Example API
This
should create a
DataFrame
with a sparseSeries
for each distinct value incategorical_columns
.Similar implementations
Pandas supports sparse data structures.
The above example is inspired from the Pandas get_dummies method.