feat(data frame): Support `polars`

machow commented 3 months ago

This PR addresses #1439, by generalizing pandas.DataFrame specific logic to include Polars. It adds a module for DataFrame specific logic (_tbl_data.py) and simple tests for each piece.

From pairing with @schloerke, it seems like these next steps might be useful:

Move DataFrame specific logic into _tbl_data.py
Test and verify it works as expected on both Polars and Pandas
Evaluate subbing in narwhals for _tbl_data.py logic (cc @MarcoGorelli)
- This should be easy with all the logic in one place / tests around behavior. Seems doable!

Notes:

It appears that shiny currently evaluates datetime columns to "unknown"
Polars coerces subclasses of str back up to str, so certain htmltools tags can't be identified in a Polars Series.
- (we discussed subclassing collections.UserString instead, but also probably worth bringing up w/ Polars)
Barret and I have pairing time set up Monday and Tues, so should be able to knock it out then!

schloerke commented 3 months ago

Notes:

It appears that shiny currently evaluates datetime columns to "unknown"

We currently do not have special support in the browser... so we let the json to-string method handle it.

But we should definitely add support at some point!

machow commented 3 months ago

It seems like the JSON string method converts it to an int

opened issue here: https://github.com/posit-dev/py-shiny/issues/1477

MarcoGorelli commented 3 months ago

Evaluate subbing in narwhals for _tbl_data.py logic (cc @MarcoGorelli)

Thanks for the ping! Happy to help here, please do let me know if there's anything missing which you'd need or which would make your work easier

As a potential consumer, I'd be particularly interested in your thoughts on the StableAPI proposal here: https://github.com/narwhals-dev/narwhals/issues/259#issuecomment-2184941233. The objective is to give consumers a way such that, so long as they use StableAPI with a given version, then we will never break any public function they use. The idea's inspired by Rust's Editions

machow commented 3 months ago

From pairing w/ Barret, we're able to serialize pandas datetime series to ISO 8601 strings. There's a note in the code about possible results from pandas infer_dtype() function. A key to this function is that afaik it's responsible for inferring types from lists, so some of the possible outputs are induced by list[some_type].

# investigating the different outputs from infer_dtype
import pandas as pd
from datetime import date, datetime

val_dt = pd.to_datetime(["2020-01-01"])
val_per = pd.Period('2012-1-1', freq='D')

col_per = pd.Series([val_per])    # period
col_date = [date.today()]   # date
col_datetime = [datetime.now()]    # datetime
col_datetime64 = pd.Series(pd.to_datetime(["2020-01-01"]))    # datetime64

pd.api.types.infer_dtype(col_date)

machow commented 3 months ago

From pairing w/ @schloerke, we discovered that pandas.DataFrame.to_json has an interesting serialization strategy for custom objects.

import pandas as pd
from dataclasses import dataclass

class C:
    x: int

    def __init__(self, x: int):
        self.x = x

    def __str__(self):
        return f"I am C({self.x})"

@dataclass
class D:
    y: int

df = pd.DataFrame({"x": [C(1), D(2)]})
df.to_json()
#>  {"x":{"0":{"x":1},"1":{"y":2}}}

Notice that it somehow serialized C(1) to {"x": 1}. This is because it seems to use C(1).__dict__. However, you can pass a default handler to override this behavior

df.to_json(default_handler=str)
#> {"x":{"0":"I am C(1)","1":"D(y=2)"}}

Notice that the outputs are now the result of called .__str__(). Critically, .to_json() passes things like built-in dictionaries and lists through, so default_handler does not affect them.

pd.DataFrame({"x": [{"A": 1}, [8, 9]]}).to_json()
#> {"x":{"0":{"A":1},"1":[8,9]}

machow commented 3 months ago

Alright, handing off to @schloerke. There are two outstanding pieces:

replacing pandas logic in _data_frame.py (and gridtable file). The functions in _tbl_data are stubbed out with basic tests, and should be rough-but-ready to replace them. We need renderer tests (playwright?) to validate they work as expected when swapped in.
swapping in narwhals: I think a nice approach to this is the strangler fig strategy. Essentially, we could register narwhals implementations in _tbl_data.py. Once all paths work with narhwals, we could always wrap dataframes to be narwhals objects (then, can eventually remove other dataframe logic from _tbl_data.py / delete it.).

More on swapping in narwhals

Currently, _tbl_data.py means to hold all DataFrame specific logic. This makes it easy to communicate and test what DataFrame actions we need. This way, we can start on refactoring in narwhals, while manually adding extra bits of DataFrame logic as needed.

For example, @schloerke mentioned needing a get_column_names() function.

@singledispatch
def get_column_names(data: DataFrameLike) -> list[str]:
    raise TypeError()

@get_column_names.register
def _(data: pd.DataFrame) -> list[str]:
    # note that technically column names don't have to be strings in Pandas
    # so you might add validation, etc.. here
    return list(data.columns)

@get_column_names.register
def _(data: pl.DataFrame) -> list[str]:
    return data.columns

@get_column_names.register
def _(data: nw.DataFrame) -> list[str]:
    return data.columns

Notice that once narwhals is fully wired up everywhere, we can just always wrap inputs to the functions with narwhals.DataFrame. This is a classic refactor approach (strangler fig pattern), because we just wrap the old system, and keep things rolling, until we can fully flip the new one on (e.g. everything narwhals).

The number 1 reason IMO for not going directly to narwhals is that the requirements on the shiny side need to be fleshed out. I think it'll help to flesh out support for pandas DataFrames, and add test cases against Polars and Pandas, to indicate when a refactor has succeeded (or is blocked).

machow commented 2 months ago

From chatting with @schloerke, I glanced over the code just now and it LGTM. I think the _pandas.py module could be moved into _tbl_data.py, so there's a very clear rule: all pandas and polars logic live in one place. But IMO having an extra file is reasonable.

I didn't know enough about a lot of the shiny bits to have an opinion on things outside _tbl_data.py 😅

posit-dev / py-shiny

feat(data frame): Support `polars` #1474

More on swapping in narwhals