Question: idiomatic way of elegantly retrieving the underlying DataFrame type

narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

https://narwhals-dev.github.io/narwhals/

MIT License

611 stars 91 forks source link

Question: idiomatic way of elegantly retrieving the underlying DataFrame type #1443

Open elephaint opened 13 hours ago

elephaint commented 13 hours ago

Currently I often have the following code:

df_nw = nw.from_native(df)
is_pandas = nw.dependencies.is_pandas_dataframe(df)
.....
if is_pandas:
     do something different
......

What is difficult about this, is that I need to keep track of is_pandas variables throughout the code, send them in subfunctions, etc. If I have multiple DataFrames, I have multiple such is_pandas variables. Ideally, I'd be able to do something such as:

is_pandas = df_nw.is_native_pandas

i.e., having whether the underlying dataframe is pandas or not simply as a boolean attribute of the Narwhals DataFrame. That would allow me to use df_nw everywhere without requiring the auxiliary variables everywhere or first converting to native.

Of course, I know I can also do this everywhere: nw.dependencies.is_pandas_dataframe(df_nw.to_native())) but that feels convoluted.

What is the cleanest way to do this?

FBruzzesi commented 12 hours ago

Hey @elephaint , thanks for your request. This can certainly be a pain point for other libraries trying to adopt narwhals.

I would say that the answer is it depends.

We have a set of functionalities, namely maybe_align_index, maybe_get_index , maybe_set_index, maybe_reset_index and maybe_convert_dtypes, which are meant to help working with pandas objects without having to manually check all the times.

If that's not enough, nw.dependencies.is_pandas_dataframe(df_nw.to_native())) is an option, another one would be df_nw._compliant_frame._implementation is Implementation.PANDAS but it doesn't look less convoluted to me.

In plotly express, I had to do something similar, by adding a flag is_pd_like on the first encounter of the dataframe object, and passing that to various functions to branch out the logic.

MarcoGorelli commented 12 hours ago

Thanks for the request!

I think currently the two documented way would be:

if nw.get_native_namespace(df) is nw.dependencies.get_pandas()
if nw.dependencies.is_pandas_dataframe(df.to_native())

I can see that it would be convenient to have something more ergonomic... 🤔 will think about this one. Thanks for having highlighted this

another one would be df_nw._compliant_frame._implementation is Implementation.PANDAS but it doesn't look less convoluted to me.

wait, this would be highly risky as it involves using private methods which may change at any time 😉 Better to stick with the public API, which we make some stability guarantees about

elephaint commented 9 hours ago

Thanks for the discussion!

I think for now I'll go with nw.dependencies.is_pandas_dataframe(df.to_native())

Just to be clear - this is really a 'nice to have' but by no means very important to me, so don't make something crazy complex over this 😛

FBruzzesi commented 8 hours ago

Just out of curiosity for now, could you point to such example that requires branching a specific path for pandas?