Open cluhmann opened 1 month ago
Last time I checked we don't really depend on pandas, it's arviz that does
Does that mean that we can use polars dataframes when using PyMC?
Does that mean that we can use polars dataframes when using PyMC
As data? Not sure, there's some special logic for handling pandas.
But PyMC does not depend on pandas, so maybe you are requesting a new feature, not a change of dependency
Btw pandas special logic is:
I don't think we can replace that by narwhals since dispatch works on types at runtime. We would need to dispatch on polars as well.
pm.data
? and observed for detecting nans and triggering automatic imputation. This could possibly be backend agnostic although that code is pretty spaghetti, so someone would need to check. 2.1 It may actually already work because IIRC it's all based on duck typing
Yeah, was taking a peek at this today -- if we were using pandas-specific functionality (merging, etc.) then it would make sense to use narwhals. For the post part we are taking DataFrames (and Series) and turning them into ndarrays. The only exception may be in deriving dims from indexes, since polars does not use indexes.
Hey, thanks for looking into Narwhals π
First, thanks for opening https://github.com/pymc-devs/pymc/pull/7463, it's great to see Polars support come along - facilitating that was one of my goals with Narwhals, and if it can happen even without it, even better πͺ
I think #7463 is already a net-positive, I just wanted to leave some comments, in case they're of interest:
nw.from_native(..., eager_only=True)
on the input, pass that to dataframe_to_tensor_variable
, and then just do:
@_as_tensor_variable.register(nw.DataFrame)
@_as_tensor_variable.register(nw.Series)
def dataframe_to_tensor_variable(df: , *args, **kwargs) -> TensorVariable:
return pt.as_tensor_variable(df.to_numpy(), *args, **kwargs)
to_numpy
is part of the Narwhals API, and nw.from_native
is a practically-free operation. But it can of course also be done without Narwhals, as you've done π
if (pd := sys.modules.get('pandas', None)) is not None and isinstance(df, pd.DataFrame):
# pandas-specific logic goes here
In fact, that's what Altair do, using is_pandas_dataframe
No hard feeling of course if you keep the current approach, I was wanted to point out the possibilities βΎοΈ All the best, and it was really fun meeting some of you at PyData London!
EDIT: upon further inspection, I was wrong about the pytensor part, Narwhals wouldn't help there (unless you used it in PyTensor too), I think it would only potentially help in convert_data
?
Before
No response
After
No response
Context for the issue:
With the rise in popularity of packages such as polars, arrow, and others PyMC's dependence on pandas is looking less and less universally useful. Narwhals was developed for developers of python libraries that consumes dataframes who wish wishing to make their libraries completely dataframe-agnostic. So maybe we consider using that instead of pandas per se? Narwhals has no dependencies and has negligible overhead, so it seems relatively lightweight.
It will require some refactoring as it relies on (as subset of) the polars API.