pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.24k stars 1.67k forks source link

Support nested datatypes in `from_repr` #15842

Open wolliq opened 1 month ago

wolliq commented 1 month ago

Description

In many ML/NLP use cases it's useful to have the reading from_repr feature supporting list type so that reading from a feature store where numerical representation are stored, e.g. embeddings vectors for unit testing. Today if we run

        import polars as pl
        dfp = pl.from_repr("""
shape: (1, 1)
┌──────────────────────────────────────────────────┐
│ segment_ids                                      │
│ ---                                              │
│ list[i32]                                        │
╞══════════════════════════════════════════════════╡
│ [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0] │
└──────────────────────────────────────────────────┘
        """)

we have

...
raise NotImplementedError(msg)
NotImplementedError: `from_repr` does not support data type 'List'

Thanks

stinodego commented 1 month ago

Thanks for the issue. This would definitely be good to support.

tharunsuresh-code commented 1 month ago

Hey, can I take this up? I assume I would need to support just polars.datatypes.FLOAT_DTYPES and polars.datatypes.INTEGER_DTYPES inside the List right?

I have made a draft pull request, would appreciate any comments :) If you think I am in the right direction, I can work on test cases and other functionalities associated with this feature.

stinodego commented 1 month ago

Hey, can I take this up? I assume I would need to support just polars.datatypes.FLOAT_DTYPES and polars.datatypes.INTEGER_DTYPES inside the List right?

Sure! Lists can contain anything though (also strings, decimals, ...). So it's not just constrained to floats/integers.

tharunsuresh-code commented 1 month ago

Got it, I'm working on it. I have doubt regarding wrap around for string representation of polars dataframe, the column data is wrapping around as follows:

shape: (2, 3)
┌─────────────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
│ f                               ┆ g                               ┆ h                               │
│ ---                             ┆ ---                             ┆ ---                             │
│ list[date]                      ┆ list[time]                      ┆ list[datetime[ns]]              │
╞═════════════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
│ [2022-07-05, 2023-02-05, 2023-… ┆ [00:00:00.000001, 12:30:45, 23… ┆ [2022-07-05 10:30:45.004560, 2… │
│ [2022-07-05, 2023-02-05, 2023-… ┆ [00:00:00.000001, 12:30:45, 23… ┆ [2022-07-05 10:30:45.004560, 2… │
└─────────────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘

Due to this, the data is truncated, any suggestion on how I can handle this?

alexander-beedie commented 1 month ago

Due to this, the data is truncated, any suggestion on how I can handle this?

The reasonable thing to do is load only the whole/valid data; truncated columns (when a frame has more cols than can be displayed) are similarly dropped. There is, after all, no way (at all) to reconstruct the truncated values, so...

tharunsuresh-code commented 1 month ago

Got it, thanks! I have raised a pull request, could you please review and let me know if there are any suggestions?