narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!
https://narwhals-dev.github.io/narwhals/
MIT License
440 stars 76 forks source link

Research / POC: would Narwhals work with Fairlearn? #204

Open MarcoGorelli opened 4 months ago

MarcoGorelli commented 4 months ago

Here's the project: https://github.com/fairlearn/fairlearn

I'd be curious to see if it could be Narwhalified

Important note: please do not make any PRs to fairlearn itself. This is just intended as an exercise to find missing things in Narwhals which we might want to add in order to make it more useful. We have not discussed anything with Fairlearn devs, so we should not make unsolicited PRs to them.

The task here is:

Once again, just to make sure there's no misunderstandings, please do not make any PRs to Fairlearn itself. This is just intended as an exploratory piece to see what we're missing in Narwhals - let's not disturb Fairlearn devs

EdAbati commented 4 months ago

Hey @MarcoGorelli :) this sounds fun, I'd like to help.

I'll go through fairlearn's modules and update you here

MarcoGorelli commented 4 months ago

Woohoo that's great! Multiple people can investigate this together and then we can share findings

EdAbati commented 4 months ago

Please see below the scripts where pandas is used (based on grep -rl --include=\*.py "import pandas" fairlearn/*)

I'll slowly update this comment with my findings. If anyone else is interested, feel free to help (maybe priotising the scripts with no comments) πŸ™‚

βœ… -> the script can be "narwhalified" ⚠️ -> missing feature

MarcoGorelli commented 4 months ago

wow, nice one @EdAbati ! just to check, for the ones you updated, do the tests still pass?

EdAbati commented 4 months ago

Your comment made me run all the tests to double check and I indeed found something missing.

πŸ’‘ Note to self (and to anyone who will help): always run all the tests to make sure nothing breaks. It looks like some methods/functions don't have a corresponding test. One can only test them by running tests for the other methods/functions that depend on them.

UPDATE: the below is not true. we can just use list(<series>)

Based on the notes on fairlearn/metrics/_annotated_metric_function.py, it could be useful to have a .to_list() for Series. See example for converting columns of nested lists to 2d arrays

>>> import pandas as pd
>>> import polars as pl
>>> import numpy as np
>>> pd_df = pd.DataFrame({
...     "d": [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6], [0.7, 0.8], [0.9, 1.0]]
... })
>>> pd_df["d"].to_numpy()
array([list([0.1, 0.2]), list([0.3, 0.4]), list([0.5, 0.6]),
       list([0.7, 0.8]), list([0.9, 1.0])], dtype=object)
>>> np.asarray(pd_df["d"].to_list())
array([[0.1, 0.2],
       [0.3, 0.4],
       [0.5, 0.6],
       [0.7, 0.8],
       [0.9, 1. ]])
>>> pl_df = pl.DataFrame(pd_df)
>>> pl_df["d"].to_numpy()
array([array([0.1, 0.2]), array([0.3, 0.4]), array([0.5, 0.6]),
       array([0.7, 0.8]), array([0.9, 1. ])], dtype=object)
>>> np.asarray(pl_df["d"].to_list())
array([[0.1, 0.2],
       [0.3, 0.4],
       [0.5, 0.6],
       [0.7, 0.8],
       [0.9, 1. ]])
MarcoGorelli commented 4 months ago

thanks @EdAbati ! does it not work to do np.asarray(list(df[data_arg_name])) like they currently do? I think that should work, even if df is a narwhals dataframe

EdAbati commented 4 months ago

Oh yeah you are right. πŸ˜… I don't know why my brain thought it was less "safe". .to_list is not need then. Updated the comment

MarcoGorelli commented 4 months ago

πŸ˜„ thanks! df[col_name] only works for eager dataframes (not lazy ones) so you'll need to pass eager_only=True to nw.from_native if you use that

cool stuff anyway, looks promising! thanks for spearheading this!

EdAbati commented 4 months ago

Couple of consideration regarding the metrics submodule. (I'm learning polars and fairlearn thanks to this issue so I may miss things, very happy to be corrected :) )

All changes in commit https://github.com/EdAbati/fairlearn/commit/7fa996178c207aa27be07f161a8c505166225b3c

EdAbati commented 4 months ago

Questions out of curiosity (it is not needed for this effort)

I was looking at the column names (because of this)

I noticed that a narwhals.Series name is always a str, even if the native Series is a pandas.Series with an int name. It doesn't seem the case for narwhals.DataFrames, .columns is not always a list[str]. What would be the best way to get a nw.Series by selecting a column with an integer name? Should it even be supported? Should narwhals.DataFrame convert column names to string too? or maybe raise a warning/error?

>>> import narwhals as nw
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3], 1: [4, 5, 6]})
>>> df.columns
Index(['a', 1], dtype='object')
>>> nw.from_native(df[1], allow_series=True).name
'1'
>>> nw_df = nw.from_native(df)
>>> nw_df.columns
['a', 1]
>>> nw_df[1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/edoardoabati/projects/fairlearn/.venv/lib/python3.11/site-packages/narwhals/dataframe.py", line 389, in __getitem__
    raise TypeError(msg)
TypeError: Expected str, range or slice, got: <class 'int'>
>>> nw_df.select(nw.col(1)) # this works, but is a DataFrame not a Series
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
| Narwhals DataFrame                            |
| Use `narwhals.to_native` to see native output |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
MarcoGorelli commented 4 months ago

thanks @EdAbati - that looks like an issue, and probably something I'd added in the early days whilst still figuring out the design. I think we should just preserve the original Series' name

the type annotation for Series.name can say -> str, that's what Polars has and it's fine - anything outside of that is technically outside of Narwhals' API. but, we want to preserve as much as possible of what the user has, so let's just keep what they started with

the fix is probably just to replace

https://github.com/narwhals-dev/narwhals/blob/56f7eaa4028b82ee40e308ccc36944b574d0c4db/narwhals/_pandas_like/series.py#L81

with self._name = series.name

do you want to make a PR with a little test?

EdAbati commented 4 months ago

Still have to see if I can find a workaround, but one method missing is .transpose(). It is used here

EdAbati commented 4 months ago

Hi @ugohuche thanks for helping out πŸ™‚

I'll update my comment with your name.

I checked your commit and FYI I'm taking a different approach (maybe not the ideal one).

I think the goal of the task is to see how narwhals can be used so that a user can pass other dataframes types to the fairlearn methods/functions. Therefore, given that X and y can be numpy arrays, pandas, polars, etc DataFrames, I am converting them to narwhals and try to do data manipulation in the "narwhals world". If the class I am working with is creating DataFrames/Series internally, I create these objects based on the type of the input. Example: if X is polars, then everything the class creates internally should be a polar dataframe/series.

I'm taking inspiration from sklego, e.g. here or here, and I did something like this for fairlearn

What do you think? Please anyone correct me if I'm missing something.

I am currently checking other modules, and what I am trying to do is not always straightforward or as @MarcoGorelli said we may see something that narwhals does not yet support.

ugohuche commented 4 months ago

Hi @EdAbati, thanks for clarifying. I'll look at it from that approach.