Closed machow closed 2 months ago
Notes:
- It appears that shiny currently evaluates datetime columns to "unknown"
We currently do not have special support in the browser... so we let the json to-string method handle it.
But we should definitely add support at some point!
It seems like the JSON string method converts it to an int
opened issue here: https://github.com/posit-dev/py-shiny/issues/1477
Evaluate subbing in narwhals for _tbl_data.py logic (cc @MarcoGorelli)
Thanks for the ping! Happy to help here, please do let me know if there's anything missing which you'd need or which would make your work easier
As a potential consumer, I'd be particularly interested in your thoughts on the StableAPI
proposal here: https://github.com/narwhals-dev/narwhals/issues/259#issuecomment-2184941233. The objective is to give consumers a way such that, so long as they use StableAPI
with a given version, then we will never break any public function they use. The idea's inspired by Rust's Editions
From pairing w/ Barret, we're able to serialize pandas datetime series to ISO 8601 strings. There's a note in the code about possible results from pandas infer_dtype()
function. A key to this function is that afaik it's responsible for inferring types from lists, so some of the possible outputs are induced by list[some_type].
# investigating the different outputs from infer_dtype
import pandas as pd
from datetime import date, datetime
val_dt = pd.to_datetime(["2020-01-01"])
val_per = pd.Period('2012-1-1', freq='D')
col_per = pd.Series([val_per]) # period
col_date = [date.today()] # date
col_datetime = [datetime.now()] # datetime
col_datetime64 = pd.Series(pd.to_datetime(["2020-01-01"])) # datetime64
pd.api.types.infer_dtype(col_date)
From pairing w/ @schloerke, we discovered that pandas.DataFrame.to_json has an interesting serialization strategy for custom objects.
import pandas as pd
from dataclasses import dataclass
class C:
x: int
def __init__(self, x: int):
self.x = x
def __str__(self):
return f"I am C({self.x})"
@dataclass
class D:
y: int
df = pd.DataFrame({"x": [C(1), D(2)]})
df.to_json()
#> {"x":{"0":{"x":1},"1":{"y":2}}}
Notice that it somehow serialized C(1) to {"x": 1}. This is because it seems to use C(1).__dict__
. However, you can pass a default handler to override this behavior
df.to_json(default_handler=str)
#> {"x":{"0":"I am C(1)","1":"D(y=2)"}}
Notice that the outputs are now the result of called .__str__()
. Critically, .to_json()
passes things like built-in dictionaries and lists through, so default_handler does not affect them.
pd.DataFrame({"x": [{"A": 1}, [8, 9]]}).to_json()
#> {"x":{"0":{"A":1},"1":[8,9]}
Alright, handing off to @schloerke. There are two outstanding pieces:
_data_frame.py
(and gridtable file). The functions in _tbl_data are stubbed out with basic tests, and should be rough-but-ready to replace them. We need renderer tests (playwright?) to validate they work as expected when swapped in._tbl_data.py
. Once all paths work with narhwals, we could always wrap dataframes to be narwhals objects (then, can eventually remove other dataframe logic from _tbl_data.py / delete it.).Currently, _tbl_data.py
means to hold all DataFrame specific logic. This makes it easy to communicate and test what DataFrame actions we need. This way, we can start on refactoring in narwhals, while manually adding extra bits of DataFrame logic as needed.
For example, @schloerke mentioned needing a get_column_names()
function.
@singledispatch
def get_column_names(data: DataFrameLike) -> list[str]:
raise TypeError()
@get_column_names.register
def _(data: pd.DataFrame) -> list[str]:
# note that technically column names don't have to be strings in Pandas
# so you might add validation, etc.. here
return list(data.columns)
@get_column_names.register
def _(data: pl.DataFrame) -> list[str]:
return data.columns
@get_column_names.register
def _(data: nw.DataFrame) -> list[str]:
return data.columns
Notice that once narwhals is fully wired up everywhere, we can just always wrap inputs to the functions with narwhals.DataFrame
. This is a classic refactor approach (strangler fig pattern), because we just wrap the old system, and keep things rolling, until we can fully flip the new one on (e.g. everything narwhals).
The number 1 reason IMO for not going directly to narwhals is that the requirements on the shiny side need to be fleshed out. I think it'll help to flesh out support for pandas DataFrames, and add test cases against Polars and Pandas, to indicate when a refactor has succeeded (or is blocked).
From chatting with @schloerke, I glanced over the code just now and it LGTM. I think the _pandas.py
module could be moved into _tbl_data.py
, so there's a very clear rule: all pandas and polars logic live in one place. But IMO having an extra file is reasonable.
I didn't know enough about a lot of the shiny bits to have an opinion on things outside _tbl_data.py
😅
This PR addresses #1439, by generalizing pandas.DataFrame specific logic to include Polars. It adds a module for DataFrame specific logic (
_tbl_data.py
) and simple tests for each piece.From pairing with @schloerke, it seems like these next steps might be useful:
_tbl_data.py
_tbl_data.py
logic (cc @MarcoGorelli)Notes:
str
back up tostr
, so certain htmltools tags can't be identified in a Polars Series.