Open WillAyd opened 3 months ago
@jorisvandenbossche maybe a good follow up to the discussion we had as part of PDEP-14
@pandas-dev/pandas-core this wasn't major enough to include as part of PDEP-14, but I think is a logical follow up to clean up semantics. Curious what others may think
I think it should be pandas_nullable
. Keeps options open with respect to the whole pd.NA/np.nan
discussion
Hey! I’d like to work on this
Take
Any other team feedback on this? I think would be good to use the new name starting with 3.0
We have pandas.api.types.pandas_dtype
where we Convert input into a pandas only dtype object ... and this returns np.dtype or a pandas dtype.
Given that the term “pandas dtype” already has a precedent, using dtype_backend="pandas"
would indeed align well with existing conventions. It provides clarity and maintains consistency.
I also have a slight preference for pandas because it is shorter, and I don't see us every introducing a non-nullable type system, so "_nullable" is superfluous
On the other hand, when specifying dtype_backend="pyarrow"
, you also get back a "pandas dtype" in that sense (i.e. a pandas ExtensionDtype subclass). And at the same time, some of the non-nullable default dtypes we have are also pandas dtypes.
So I don't think dtype_backend="pandas"
is an ideal naming, but I also don't have any better suggestion ..
masked
On the other hand, when specifying
dtype_backend="pyarrow"
, you also get back a "pandas dtype" in that sense (i.e. a pandas ExtensionDtype subclass). And at the same time, some of the non-nullable default dtypes we have are also pandas dtypes.
That's true as a matter of implementation, but I don't think end users are going to know that
So I don't think
dtype_backend="pandas"
is an ideal naming, but I also don't have any better suggestion ..
I did suggest pandas_nullable
above. I think I may have been the one to introduce the word "nullable" into our lexicon. So if we use pandas_nullable
, it's clear that we are storing a pandas rep of missing values in the backend. I'm concerned that just using pandas
could prevent some other usage that we don't see now, but want to introduce in the future.
That's a fair point, though I'm not sure that adding _nullable prevents that. I think that would only prevent an issue if we decided to offer non-nullable types
That's a fair point, though I'm not sure that adding _nullable prevents that. I think that would only prevent an issue if we decided to offer non-nullable types
Or offer something else that we can't foresee today
On the other hand, when specifying
dtype_backend="pyarrow"
, you also get back a "pandas dtype" in that sense (i.e. a pandas ExtensionDtype subclass). And at the same time, some of the non-nullable default dtypes we have are also pandas dtypes.So I don't think
dtype_backend="pandas"
is an ideal naming, but I also don't have any better suggestion ..
PyArrow types indeed are pandas extension types, enhancing the functionality of the base PyArrow library to suit our use case of backing DataFrames or Series.
We don't always rigidly adhere to the behavior of NumPy arrays for a Series with a NumPy dtype. We allow expansion, upcasting, and other conversions that may diverge from NumPy behavior, even though we return a NumPy type as the dtype.
But I see no problems when we use the terms "pyarrow" or "numpy" when we talk about the backend. So it would seem reasonable to me to use the term "pandas" to describe the pandas nullable extension types.
I did suggest
pandas_nullable
above. I think I may have been the one to introduce the word "nullable" into our lexicon. So if we usepandas_nullable
, it's clear that we are storing a pandas rep of missing values in the backend. I'm concerned that just usingpandas
could prevent some other usage that we don't see now, but want to introduce in the future.
The dtype_backend
argument is forward-thinking, enabling early adoption of experimental data types that aren't currently the default.
Presently, the available options for dtype_backend
in I/O methods and .convert_dtypes
are limited to 'numpy_nullable'
and 'pyarrow'
.
If we aim to allow users to continue using legacy types even when nullable types become the default, introducing an additional argument makes sense.
Considering package names, options like pyarrow
, pandas
, and numpy
would be meaningful, clear, concise, and consistent choices?
I'm on board with what @simonjayhawkins is suggesting - pyarrow, pandas, and numpy as arguments reflect the core of the type system evolution, even if they may not be 100% technically accurate
If we do decide on those terms, I also wonder if we should change the default value of None
to "numpy"
Feature Type
[ ] Adding new functionality to pandas
[X] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas
Problem Description
Many I/O methods today accept a "numpy_nullable" argument for the dtype_backend= parameter. While historically our extension arrays exclusively used NumPy, this is no longer true with the string dtype so the name "numpy_nullable" is a misnomer.
Feature Description
To make for a less confusing API, I would suggest adding "pandas_nullable" or maybe even just "pandas" as an argument. This can have the exact same behavior as "numpy_nullable" today but abstracts and corrects the semantics. "numpy_nullable" can be slowly deprecated over time
Alternative Solutions
n/a
Additional Context
dtype_backend="pandas" would also make for a smoother transition into the logical type system proposed as part of PDEP-13 https://github.com/pandas-dev/pandas/pull/58455
...but even if that PDEP is not accepted, I still see value in changing the value "numpy_nullable" to something else