pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.41k stars 17.83k forks source link

ENH: New Name for "numpy_nullable" dtype_backend #59032

Open WillAyd opened 3 months ago

WillAyd commented 3 months ago

Feature Type

Problem Description

Many I/O methods today accept a "numpy_nullable" argument for the dtype_backend= parameter. While historically our extension arrays exclusively used NumPy, this is no longer true with the string dtype so the name "numpy_nullable" is a misnomer.

Feature Description

To make for a less confusing API, I would suggest adding "pandas_nullable" or maybe even just "pandas" as an argument. This can have the exact same behavior as "numpy_nullable" today but abstracts and corrects the semantics. "numpy_nullable" can be slowly deprecated over time

Alternative Solutions

n/a

Additional Context

dtype_backend="pandas" would also make for a smoother transition into the logical type system proposed as part of PDEP-13 https://github.com/pandas-dev/pandas/pull/58455

...but even if that PDEP is not accepted, I still see value in changing the value "numpy_nullable" to something else

WillAyd commented 3 months ago

@jorisvandenbossche maybe a good follow up to the discussion we had as part of PDEP-14

WillAyd commented 2 months ago

@pandas-dev/pandas-core this wasn't major enough to include as part of PDEP-14, but I think is a logical follow up to clean up semantics. Curious what others may think

Dr-Irv commented 2 months ago

I think it should be pandas_nullable . Keeps options open with respect to the whole pd.NA/np.nan discussion

chaarvii commented 2 months ago

Hey! I’d like to work on this

chaarvii commented 2 months ago

Take

WillAyd commented 2 months ago

Any other team feedback on this? I think would be good to use the new name starting with 3.0

simonjayhawkins commented 2 months ago

We have pandas.api.types.pandas_dtype where we Convert input into a pandas only dtype object ... and this returns np.dtype or a pandas dtype.

Given that the term “pandas dtype” already has a precedent, using dtype_backend="pandas" would indeed align well with existing conventions. It provides clarity and maintains consistency.

WillAyd commented 2 months ago

I also have a slight preference for pandas because it is shorter, and I don't see us every introducing a non-nullable type system, so "_nullable" is superfluous

jorisvandenbossche commented 2 months ago

On the other hand, when specifying dtype_backend="pyarrow", you also get back a "pandas dtype" in that sense (i.e. a pandas ExtensionDtype subclass). And at the same time, some of the non-nullable default dtypes we have are also pandas dtypes.

So I don't think dtype_backend="pandas" is an ideal naming, but I also don't have any better suggestion ..

jbrockmendel commented 2 months ago

masked

WillAyd commented 2 months ago

On the other hand, when specifying dtype_backend="pyarrow", you also get back a "pandas dtype" in that sense (i.e. a pandas ExtensionDtype subclass). And at the same time, some of the non-nullable default dtypes we have are also pandas dtypes.

That's true as a matter of implementation, but I don't think end users are going to know that

Dr-Irv commented 2 months ago

So I don't think dtype_backend="pandas" is an ideal naming, but I also don't have any better suggestion ..

I did suggest pandas_nullable above. I think I may have been the one to introduce the word "nullable" into our lexicon. So if we use pandas_nullable, it's clear that we are storing a pandas rep of missing values in the backend. I'm concerned that just using pandas could prevent some other usage that we don't see now, but want to introduce in the future.

WillAyd commented 2 months ago

That's a fair point, though I'm not sure that adding _nullable prevents that. I think that would only prevent an issue if we decided to offer non-nullable types

Dr-Irv commented 2 months ago

That's a fair point, though I'm not sure that adding _nullable prevents that. I think that would only prevent an issue if we decided to offer non-nullable types

Or offer something else that we can't foresee today

simonjayhawkins commented 2 months ago

On the other hand, when specifying dtype_backend="pyarrow", you also get back a "pandas dtype" in that sense (i.e. a pandas ExtensionDtype subclass). And at the same time, some of the non-nullable default dtypes we have are also pandas dtypes.

So I don't think dtype_backend="pandas" is an ideal naming, but I also don't have any better suggestion ..

PyArrow types indeed are pandas extension types, enhancing the functionality of the base PyArrow library to suit our use case of backing DataFrames or Series.

We don't always rigidly adhere to the behavior of NumPy arrays for a Series with a NumPy dtype. We allow expansion, upcasting, and other conversions that may diverge from NumPy behavior, even though we return a NumPy type as the dtype.

But I see no problems when we use the terms "pyarrow" or "numpy" when we talk about the backend. So it would seem reasonable to me to use the term "pandas" to describe the pandas nullable extension types.

I did suggest pandas_nullable above. I think I may have been the one to introduce the word "nullable" into our lexicon. So if we use pandas_nullable, it's clear that we are storing a pandas rep of missing values in the backend. I'm concerned that just using pandas could prevent some other usage that we don't see now, but want to introduce in the future.

The dtype_backend argument is forward-thinking, enabling early adoption of experimental data types that aren't currently the default.

Presently, the available options for dtype_backend in I/O methods and .convert_dtypes are limited to 'numpy_nullable' and 'pyarrow'.

If we aim to allow users to continue using legacy types even when nullable types become the default, introducing an additional argument makes sense.

Considering package names, options like pyarrow, pandas, and numpy would be meaningful, clear, concise, and consistent choices?

WillAyd commented 2 months ago

I'm on board with what @simonjayhawkins is suggesting - pyarrow, pandas, and numpy as arguments reflect the core of the type system evolution, even if they may not be 100% technically accurate

WillAyd commented 2 months ago

If we do decide on those terms, I also wonder if we should change the default value of None to "numpy"