Open MarcoGorelli opened 4 months ago
Hey @MarcoGorelli :) this sounds fun, I'd like to help.
I'll go through fairlearn
's modules and update you here
Woohoo that's great! Multiple people can investigate this together and then we can share findings
Please see below the scripts where pandas
is used (based on grep -rl --include=\*.py "import pandas" fairlearn/*
)
I'll slowly update this comment with my findings. If anyone else is interested, feel free to help (maybe priotising the scripts with no comments) π
β -> the script can be "narwhalified" β οΈ -> missing feature
.iloc
alternative in narwhals?DataFrame.sample
is requiredsklearn
when the matrix X and y are None
. These error messages may have to change too. If we pass None
to nw.from_native
we get an error from narwhals
.wow, nice one @EdAbati ! just to check, for the ones you updated, do the tests still pass?
Your comment made me run all the tests to double check and I indeed found something missing.
π‘ Note to self (and to anyone who will help): always run all the tests to make sure nothing breaks. It looks like some methods/functions don't have a corresponding test. One can only test them by running tests for the other methods/functions that depend on them.
UPDATE: the below is not true. we can just use list(<series>)
Based on the notes on fairlearn/metrics/_annotated_metric_function.py
, it could be useful to have a .to_list()
for Series.
See example for converting columns of nested lists to 2d arrays
>>> import pandas as pd
>>> import polars as pl
>>> import numpy as np
>>> pd_df = pd.DataFrame({
... "d": [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6], [0.7, 0.8], [0.9, 1.0]]
... })
>>> pd_df["d"].to_numpy()
array([list([0.1, 0.2]), list([0.3, 0.4]), list([0.5, 0.6]),
list([0.7, 0.8]), list([0.9, 1.0])], dtype=object)
>>> np.asarray(pd_df["d"].to_list())
array([[0.1, 0.2],
[0.3, 0.4],
[0.5, 0.6],
[0.7, 0.8],
[0.9, 1. ]])
>>> pl_df = pl.DataFrame(pd_df)
>>> pl_df["d"].to_numpy()
array([array([0.1, 0.2]), array([0.3, 0.4]), array([0.5, 0.6]),
array([0.7, 0.8]), array([0.9, 1. ])], dtype=object)
>>> np.asarray(pl_df["d"].to_list())
array([[0.1, 0.2],
[0.3, 0.4],
[0.5, 0.6],
[0.7, 0.8],
[0.9, 1. ]])
thanks @EdAbati ! does it not work to do np.asarray(list(df[data_arg_name]))
like they currently do? I think that should work, even if df
is a narwhals dataframe
Oh yeah you are right. π
I don't know why my brain thought it was less "safe".
.to_list
is not need then. Updated the comment
π thanks! df[col_name]
only works for eager dataframes (not lazy ones) so you'll need to pass eager_only=True
to nw.from_native
if you use that
cool stuff anyway, looks promising! thanks for spearheading this!
Couple of consideration regarding the metrics
submodule. (I'm learning polars
and fairlearn
thanks to this issue so I may miss things, very happy to be corrected :) )
pandas
because of MetricFrame
. It creates internally a pandas.DataFrame
that is used for example to store metrics values per group . This dataframe is also used for visualisation (example)
IMO we should leave this internal dataframe as-is, given that it is not provided by the user, and (I think) will tend to always be small.MetricFrame
that is provided by the user and could be coming from "non-pandas-dataframes" are sensitive_features
and control_features
. I think here we should use narwhals
to allow different inputs types, but then convert these into the pandas
world to match the "internal dataframe".All changes in commit https://github.com/EdAbati/fairlearn/commit/7fa996178c207aa27be07f161a8c505166225b3c
Questions out of curiosity (it is not needed for this effort)
I was looking at the column names (because of this)
I noticed that a narwhals.Series
name is always a str
, even if the native Series is a pandas.Series
with an int
name.
It doesn't seem the case for narwhals.DataFrame
s, .columns
is not always a list[str]
.
What would be the best way to get a nw.Series
by selecting a column with an integer name? Should it even be supported?
Should narwhals.DataFrame
convert column names to string too? or maybe raise a warning/error?
>>> import narwhals as nw
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3], 1: [4, 5, 6]})
>>> df.columns
Index(['a', 1], dtype='object')
>>> nw.from_native(df[1], allow_series=True).name
'1'
>>> nw_df = nw.from_native(df)
>>> nw_df.columns
['a', 1]
>>> nw_df[1]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/edoardoabati/projects/fairlearn/.venv/lib/python3.11/site-packages/narwhals/dataframe.py", line 389, in __getitem__
raise TypeError(msg)
TypeError: Expected str, range or slice, got: <class 'int'>
>>> nw_df.select(nw.col(1)) # this works, but is a DataFrame not a Series
βββββββββββββββββββββββββββββββββββββββββββββββββ
| Narwhals DataFrame |
| Use `narwhals.to_native` to see native output |
βββββββββββββββββββββββββββββββββββββββββββββββββ
thanks @EdAbati - that looks like an issue, and probably something I'd added in the early days whilst still figuring out the design. I think we should just preserve the original Series' name
the type annotation for Series.name
can say -> str
, that's what Polars has and it's fine - anything outside of that is technically outside of Narwhals' API. but, we want to preserve as much as possible of what the user has, so let's just keep what they started with
the fix is probably just to replace
with self._name = series.name
do you want to make a PR with a little test?
Still have to see if I can find a workaround, but one method missing is .transpose()
.
It is used here
Hi @ugohuche thanks for helping out π
I'll update my comment with your name.
I checked your commit and FYI I'm taking a different approach (maybe not the ideal one).
I think the goal of the task is to see how narwhals
can be used so that a user can pass other dataframes types to the fairlearn
methods/functions.
Therefore, given that X
and y
can be numpy arrays, pandas, polars, etc DataFrames, I am converting them to narwhals and try to do data manipulation in the "narwhals world".
If the class I am working with is creating DataFrames/Series internally, I create these objects based on the type of the input.
Example: if X is polars, then everything the class creates internally should be a polar dataframe/series.
I'm taking inspiration from sklego
, e.g. here or here, and I did something like this for fairlearn
What do you think? Please anyone correct me if I'm missing something.
I am currently checking other modules, and what I am trying to do is not always straightforward or as @MarcoGorelli said we may see something that narwhals
does not yet support.
Hi @EdAbati, thanks for clarifying. I'll look at it from that approach.
Here's the project: https://github.com/fairlearn/fairlearn
I'd be curious to see if it could be Narwhalified
Important note: please do not make any PRs to fairlearn itself. This is just intended as an exercise to find missing things in Narwhals which we might want to add in order to make it more useful. We have not discussed anything with Fairlearn devs, so we should not make unsolicited PRs to them.
The task here is:
Once again, just to make sure there's no misunderstandings, please do not make any PRs to Fairlearn itself. This is just intended as an exploratory piece to see what we're missing in Narwhals - let's not disturb Fairlearn devs