pymc-devs / pymc

Bayesian Modeling and Probabilistic Programming in Python
https://docs.pymc.io/
Other
8.63k stars 1.99k forks source link

ENH: Replace pandas dependence/use with narwhals #7462

Open cluhmann opened 1 month ago

cluhmann commented 1 month ago

Before

No response

After

No response

Context for the issue:

With the rise in popularity of packages such as polars, arrow, and others PyMC's dependence on pandas is looking less and less universally useful. Narwhals was developed for developers of python libraries that consumes dataframes who wish wishing to make their libraries completely dataframe-agnostic. So maybe we consider using that instead of pandas per se? Narwhals has no dependencies and has negligible overhead, so it seems relatively lightweight.

It will require some refactoring as it relies on (as subset of) the polars API.

ricardoV94 commented 1 month ago

Last time I checked we don't really depend on pandas, it's arviz that does

cluhmann commented 1 month ago

Does that mean that we can use polars dataframes when using PyMC?

ricardoV94 commented 1 month ago

Does that mean that we can use polars dataframes when using PyMC

As data? Not sure, there's some special logic for handling pandas.

But PyMC does not depend on pandas, so maybe you are requesting a new feature, not a change of dependency

ricardoV94 commented 1 month ago

Btw pandas special logic is:

  1. Dispatch so pt.as_tensor accepts pandas series/matrices

I don't think we can replace that by narwhals since dispatch works on types at runtime. We would need to dispatch on polars as well.

  1. Special logic in pm.data? and observed for detecting nans and triggering automatic imputation. This could possibly be backend agnostic although that code is pretty spaghetti, so someone would need to check.

2.1 It may actually already work because IIRC it's all based on duck typing

fonnesbeck commented 4 weeks ago

Yeah, was taking a peek at this today -- if we were using pandas-specific functionality (merging, etc.) then it would make sense to use narwhals. For the post part we are taking DataFrames (and Series) and turning them into ndarrays. The only exception may be in deriving dims from indexes, since polars does not use indexes.

MarcoGorelli commented 4 weeks ago

Hey, thanks for looking into Narwhals πŸ™

First, thanks for opening https://github.com/pymc-devs/pymc/pull/7463, it's great to see Polars support come along - facilitating that was one of my goals with Narwhals, and if it can happen even without it, even better πŸ’ͺ

I think #7463 is already a net-positive, I just wanted to leave some comments, in case they're of interest:

No hard feeling of course if you keep the current approach, I was wanted to point out the possibilities ♾️ All the best, and it was really fun meeting some of you at PyData London!

EDIT: upon further inspection, I was wrong about the pytensor part, Narwhals wouldn't help there (unless you used it in PyTensor too), I think it would only potentially help in convert_data?