pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.82k stars 1.7k forks source link

Aliasing Columns Names in Map_Element () with Dataclass Fields #17234

Open starzar opened 1 week ago

starzar commented 1 week ago

Checks

Reproducible example

from dataclasses import dataclass
import polars as pl

@dataclass
class Long:
    trader_a: int
    trader_b: int
    trader_c: int

data = {
    "Ticker": ["AAPL", "GOOGL", "MSFT"],
    "Market_Long": [
        Long(trader_a=1, trader_b= 2, trader_c=3),
        Long(trader_a=1,trader_b=2, trader_c= 3),
        Long(trader_a=1,trader_b=2, trader_c=3)
    ]
}
Total_df = pl.DataFrame(data)
print("Total_df")
print(Total_df)
print(Total_df["Market_Long"][0])

def AnalyzePositions_Alias( col_name):
    sum_shares = col_name[0]['trader_a']+col_name['trader_b']+col_name['trader_c']
    print("sum_shares")
    print(sum_shares)

    return sum_shares

# Try 1 (aliasing column name doesn't work as map_element on dataclass field returns a dict)
TotalAnalysis_df = Total_df.select(

    "Ticker",
    # AttributeError: 'dict' object has no attribute 'alias'
    # pl.col("Market_Long").map_elements(function=AnalyzePositions_Alias,return_dtype=pl.Int32,pass_name=True).alias("Sum_Market_Long")

    pl.col("Market_Long").map_elements(function=AnalyzePositions_Alias,return_dtype=pl.Int32,pass_name=True),

    )

# Try 2 & 3 (processing column entirely in a function)

# def AnalyzePositions_returnProcessedColumn( df,col_name):
#
#   return df.select(
#       "Ticker",
#       #Try2 - TypeError: 'Expr' object is not subscriptable"
#       pl.struct([pl.col(col_name)["trader_a"], pl.col(col_name)["trader_b"], pl.col(col_name)["trader_c"]]).sum().alias("Processed_Market_Long")
#
#       #Try3- WORKS but Calculated sum is WRONG . total value of all the indices is calucated as opposed to individual index
#       (pl.col(col_name).struct.field("trader_a") + pl.col(col_name).struct.field("trader_b") + pl.col(
#           col_name).struct.field("trader_c")).sum().alias("Sum_Market_Long")
# )
#
# TotalAnalysis_df = AnalyzePositions_returnProcessedColumn(df= Total_df.clone(),col_name="Market_Long")
#
# print(TotalAnalysis_df)

Log output

Total_df
shape: (3, 2)
┌────────┬─────────────┐
│ Ticker ┆ Market_Long │
│ ---    ┆ ---         │
│ str    ┆ struct[3]   │
╞════════╪═════════════╡
│ AAPL   ┆ {1,2,3}     │
│ GOOGL  ┆ {1,2,3}     │
│ MSFT   ┆ {1,2,3}     │
└────────┴─────────────┘
{'trader_a': 1, 'trader_b': 2, 'trader_c': 3}
thread 'polars-0' panicked at py-polars\src\map\series.rs:222:19:
python function failed AttributeError: 'dict' object has no attribute 'alias'
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
--- PyO3 is resuming a panic after fetching a PanicException from Python. ---
Python stack trace below:
Traceback (most recent call last):
  File "C:\Users\User_0\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\expr\expr.py", line 4690, in __call__
    result = self.function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User_0\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\expr\expr.py", line 5059, in wrap_f
    return x.map_elements(
           ^^^^^^^^^^^^^^^
  File "C:\Users\User_0\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\series\series.py", line 5519, in map_elements
    self._s.apply_lambda(function, pl_return_dtype, skip_nulls)
pyo3_runtime.PanicException: python function failed AttributeError: 'dict' object has no attribute 'alias'
Traceback (most recent call last):
  File "C:\Users\User_0\Documents\Code\Python\NseScraping\Others\Polars\tests\dataclassValues.py", line 32, in <module>
    TotalAnalysis_df = Total_df.select(
                       ^^^^^^^^^^^^^^^^
  File "C:\Users\User_0\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\dataframe\frame.py", line 8461, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User_0\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\lazyframe\frame.py", line 1967, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: python function failed AttributeError: 'dict' object has no attribute 'alias'

Issue description

Polars(0.20.31) Trying to alias the processed column returned by map_element(custom_function) gives 'AttributeError: 'dict' object has no attribute 'alias'' (refer code-Try_1 ).

Also,directly processing and returning a column in as custom function does not get processed correctly(refer code-Try_2 &Try_3.)How to access nested values in dataclass fields ?

Tried 3 different ways to access them , but each way has some issue.

Expected behavior

Show nested values in data class fields

Installed versions

``` Polars(0.20.31) ```
alexander-beedie commented 1 week ago

Definitely don't do this in a Python function ;)

I think you just want this: unnest the struct column for simplicity, and then get the horizontal sum of everything except the first field (selectors[^1] make that kind of column selection pretty easy - adjust as needed):

import polars.selectors as cs

Total_df.unnest("Market_Long").with_columns(
    Sum_Market_Long=pl.sum_horizontal(~cs.first())
)
# shape: (3, 5)
# ┌────────┬──────────┬──────────┬──────────┬─────────────────┐
# │ Ticker ┆ trader_a ┆ trader_b ┆ trader_c ┆ Sum_Market_Long │
# │ ---    ┆ ---      ┆ ---      ┆ ---      ┆ ---             │
# │ str    ┆ i64      ┆ i64      ┆ i64      ┆ i64             │
# ╞════════╪══════════╪══════════╪══════════╪═════════════════╡
# │ AAPL   ┆ 1        ┆ 2        ┆ 3        ┆ 6               │
# │ GOOGL  ┆ 1        ┆ 2        ┆ 3        ┆ 6               │
# │ MSFT   ┆ 1        ┆ 2        ┆ 3        ┆ 6               │
# └────────┴──────────┴──────────┴──────────┴─────────────────┘

[^1]: Polars selectors https://docs.pola.rs/api/python/stable/reference/selectors.html

starzar commented 1 week ago

@alexander-beedie thanks for the reply...the selector works fine for a single nested dataclass , but for multidimensional dataclasses, there is more functionality required.Are there any other ways to access deep nested fields of mutlidimensional dataclasses?

alexander-beedie commented 1 week ago

You should probably just use the Struct field access, optionally in conjunction with selectors, optionally unnesting to simplify the top-level (as shown). Once it's in Polars it's not a dataclass anymore (obviously ;) it is a Struct composed from the top-level fields of the input dataclass, so it's optimal to interact with it as such.

I'd suggest providing a more specific example that's closer to your real data and asking in the Discord channel for some suggestions (I can pick this up there if you do so) 👍