pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.19k stars 17.77k forks source link

ENH: Consistent naming conventions for string dtype aliases #58141

Open WillAyd opened 5 months ago

WillAyd commented 5 months ago

Feature Type

Problem Description

Right now the string aliases for our types is inconsistent

>>> import pandas as pd
>>> pd.Series(range(3), dtype="int8")  # NumPy type
>>> pd.Series(range(3), dtype="Int8")  # Pandas extension type
>>> pd.Series(range(3), dtype="int8[pyarrow]") # Arrow type

Strings have a similar inconsistency with "string", "string[pyarrow]" and "string[pyarrow_numpy]"

Feature Description

I think we should create"int8[numpy]" and "int8[pandas]" aliases to stay consistent with pyarrow. This also has the advantage of decoupling "int8" from NumPy, so perhaps in the future we can allow the setting of the backend determine if NumPy or pyarrow types are returned

The pattern thus becomes "data_type[backend]", with the exception of "string[pyarrow_numpy]" which combines combines the backend and nullability semantics together. I am less sure what to do in that case - maybe even that should be called "string[pyarrow, numpy]" where the second argument is nullability?

In any case I am just hoping we can start to detach the logical type from the physical storage / nulllability semantics with a well defined pattern

@phofl

Alternative Solutions

n/a

Additional Context

No response

WillAyd commented 5 months ago

Meant to tag @jorisvandenbossche

jbrockmendel commented 5 months ago

i like this idea, though as i mentioned at the sprint i think we should avoid "backend". maybe dtype "family"?

WillAyd commented 5 months ago

Maybe "type provider"?

WillAyd commented 4 months ago

Thinking through some more what I suggested above for type_category[type_provider, nullability_provider] won't always work as a pattern because there are still types that accept more arguments, e.g. datetime, pa.list, pa.dictionary, etc...

I am wondering now if it is even worth trying to support string aliases or if we should push users towards using a more explicit dtype construction. This would be a change from where we are today but could be better in the long run (?)

WillAyd commented 4 months ago

As an exercise I tried to map out all of the types that pandas does today support (or reasonably could in the near term) and place in a hierarchy. Here is what I was able to come up with:

pandas type system

Tagging @pandas-dev/pandas-core in case this is of use to the larger team

graphviz used to build this:

digraph type_graph { node [shape=box]; "type" "type" -> "scalar" "scalar" -> "numeric" "numeric" -> "integral" "integral" -> "signed" subgraph cluster_signed { edge [style=invis] node [fillcolor="lightgreen" style=filled] "np.int8"; node [fillcolor="lightgreen" style=filled] "np.int16"; node [fillcolor="lightgreen" style=filled] "np.int32"; node [fillcolor="lightgreen" style=filled] "np.int64"; node [fillcolor="lightblue" style=filled] "pd.Int8Dtype"; node [fillcolor="lightblue" style=filled] "pd.Int16Dtype"; node [fillcolor="lightblue" style=filled] "pd.Int32Dtype"; node [fillcolor="lightblue" style=filled] "pd.Int64Dtype"; node [fillcolor="lightgray" style=filled] "pa.int8"; node [fillcolor="lightgray" style=filled] "pa.int16"; node [fillcolor="lightgray" style=filled] "pa.int32"; node [fillcolor="lightgray" style=filled] "pa.int64"; "np.int8" -> "np.int16" -> "np.int32" -> "np.int64" "pd.Int8Dtype" -> "pd.Int16Dtype" -> "pd.Int32Dtype" -> "pd.Int64Dtype" "pa.int8" -> "pa.int16" -> "pa.int32" -> "pa.int64" } "signed" -> "pd.Int8Dtype" [arrowsize=0] "integral" -> "unsigned" subgraph cluster_unsigned { edge [style=invis] node [fillcolor="lightgreen" style=filled] "np.uint8"; node [fillcolor="lightgreen" style=filled] "np.uint16"; node [fillcolor="lightgreen" style=filled] "np.uint32"; node [fillcolor="lightgreen" style=filled] "np.uint64"; node [fillcolor="lightblue" style=filled] "pd.UInt8Dtype"; node [fillcolor="lightblue" style=filled] "pd.UInt16Dtype"; node [fillcolor="lightblue" style=filled] "pd.UInt32Dtype"; node [fillcolor="lightblue" style=filled] "pd.UInt64Dtype"; node [fillcolor="lightgray" style=filled] "pa.uint8"; node [fillcolor="lightgray" style=filled] "pa.uint16"; node [fillcolor="lightgray" style=filled] "pa.uint32"; node [fillcolor="lightgray" style=filled] "pa.uint64"; "np.uint8" -> "np.uint16" -> "np.uint32" -> "np.uint64" "pd.UInt8Dtype" -> "pd.UInt16Dtype" -> "pd.UInt32Dtype" -> "pd.UInt64Dtype" "pa.uint8" -> "pa.uint16" -> "pa.uint32" -> "pa.uint64" } "unsigned" -> "pd.UInt8Dtype" [arrowsize=0] "numeric" -> "floating point" subgraph cluster_floating { edge [style=invis] node [fillcolor="lightgreen" style=filled] "np.float32"; node [fillcolor="lightgreen" style=filled] "np.float64"; node [fillcolor="lightblue" style=filled] "pd.Float32Dtype"; node [fillcolor="lightblue" style=filled] "pd.Float64Dtype"; node [fillcolor="lightgray" style=filled] "pa.float32"; node [fillcolor="lightgray" style=filled] "pa.float64"; "np.float32" -> "np.float64" "pd.Float32Dtype" -> "pd.Float64Dtype" "pa.float32" -> "pa.float64" } "floating point" -> "pd.Float32Dtype" [arrowsize=0] "numeric" -> "fixed point" subgraph cluster_fixed { edge [style=invis] node [fillcolor="lightgray" style=filled] "pa.decimal128"; node [fillcolor="lightgray" style=filled] "pa.decimal256"; "pa.decimal128" -> "pa.decimal256" } "fixed point" -> "pa.decimal128" [arrowsize=0] "scalar" -> "boolean" subgraph cluster_boolean { edge[style=invis] node[fillcolor="lightgreen" style=filled] "np.bool_"; node[fillcolor="lightblue" style=filled] "pd.BooleanDtype"; node[fillcolor="lightgray" style=filled] "pa.bool_"; } "boolean" -> "pd.BooleanDtype" [arrowsize=0] "scalar" -> "temporal" "temporal" -> "date" subgraph cluster_date { edge [style=invis] node [fillcolor="lightgray" style=filled] "pa.date32" node [fillcolor="lightgray" style=filled] "pa.date64" "pa.date32" -> "pa.date64" } "date" -> "pa.date32" [arrowsize=0] "temporal" -> "datetime" subgraph cluster_timestamp { edge [style=invis] node [fillcolor="lightblue" style=filled] "datetime64[unit, tz]"; node [fillcolor="lightgray" style=filled] "pa.timestamp(unit, tz)"; "datetime64[unit, tz]" -> "pa.timestamp(unit, tz)" [style=invis] } "datetime" -> "datetime64[unit, tz]" [arrowsize=0] "temporal" -> "duration" subgraph cluster_duration { edge [style=invis] node [fillcolor="lightblue" style=filled] "timedelta64[unit]"; node [fillcolor="lightgray" style=filled] "pa.duration(unit)"; "timedelta64[unit]" -> "pa.duration(unit)" [style=invis] } "duration" -> "timedelta64[unit]" [arrowsize=0] "temporal" -> "interval" "pa.month_day_nano_interval" [fillcolor="lightgray" style=filled] "interval" -> "pa.month_day_nano_interval" "scalar" -> "binary" subgraph cluster_binary { edge [style=invis] node [fillcolor="lightgray" style=filled] "pa.binary"; node [fillcolor="lightgray" style=filled] "pa.large_binary"; "pa.binary" -> "pa.large_binary" } "binary" -> "pa.binary" "binary" -> "string" subgraph cluster_string { edge [style=invis] node [fillcolor="lightgreen" style=filled] "object"; node [fillcolor="lightgreen" style=filled] "np.StringDType"; node [fillcolor="lightblue" style=filled] "pd.StringDtype"; node [fillcolor="lightgray" style=filled] "pa.string"; node [fillcolor="lightgray" style=filled] "pa.large_string"; node [fillcolor="lightgray:lightgreen" style=filled] "string[pyarrow_numpy]"; "object" -> "np.StringDType" "pa.string" -> "pa.large_string" } "string" -> "pa.string" [arrowsize=0] "scalar" -> "categorical" subgraph cluster_categorical { edge [style=invis] node [fillcolor="lightblue" style=filled] "pd.CategoricalDtype"; node [fillcolor="lightgray" style=filled] "pa.dictionary(index_type, value_type)"; "pd.CategoricalDtype" -> "pa.dictionary(index_type, value_type)" } "categorical" -> "pd.CategoricalDtype" [arrowsize=0] "scalar" -> "sparse" "pd.SparseDtype(dtype)" [fillcolor="lightblue" style=filled]; "sparse" -> "pd.SparseDtype(dtype)" [arrowsize=0] "type" -> "aggregate" "aggregate" -> "list" subgraph cluster_list { edge [style=invis] node [fillcolor="lightgray" style=filled] "pa.list_(value_type)"; node [fillcolor="lightgray" style=filled] "pa.large_list(value_type)"; "pa.list_(value_type)" -> "pa.large_list(value_type)" } "list" -> "pa.list_(value_type)" [arrowsize=0] "aggregate" -> "struct" "pa.struct(fields)" [fillcolor="lightgray" style=filled] "struct" -> "pa.struct(fields)" [arrowsize=0] "aggregate" -> "dictionary" "dictionary" -> "pa.dictionary(index_type, value_type)" [arrowsize=0] "pa.map(index_type, value_type)" [fillcolor="lightgray" style=filled] "dictionary" -> "pa.map(index_type, value_type)" [arrowsize=0] }
Dr-Irv commented 4 months ago

I am wondering now if it is even worth trying to support string aliases or if we should push users towards using a more explicit dtype construction. This would be a change from where we are today but could be better in the long run (?)

From a typing perspective, supporting all the different string versions of valid types for dtype are a PITA in pandas-stubs. So I'd be supportive of just having a class hierarchy to represent valid dtypes.

Having said that, if we are to deprecate the strings, we'd probably need a PDEP for that....

mroeschke commented 4 months ago

I am wondering now if it is even worth trying to support string aliases or if we should push users towards using a more explicit dtype construction.

I would be supportive of this as well. Especially for dtypes as strings that take parameters (timezone types, decimal types), it would be great to avoid string parsing to dtype object construction

jorisvandenbossche commented 4 months ago

In any case I am just hoping we can start to detach the logical type from the physical storage / nulllability semantics with a well defined pattern

To your original point, I very much agree with this (at least for the physical storage, not necessarily for nullability semantics because I personally think we should move to just having one nullability semantic, but that's the topic for another PDEP)

This is a topic that I brought up last summer during the sprint, but never got around writing up publicly. The summary is that I would like to see us move to just having "pandas" dtypes, at least for the majority of the users that don't need to know the lower-level details. Most users just need to know they have eg a "int64" or "string" column, and don't have to care whether that is under the hood stored using a single numpy array, a combo of numpy arrays (our masked arrays) or a pyarrow array.

The current string aliases for non-default dtypes are I think mostly a band-aid to let people more easily specify those dtypes, and I fully agree those aren't very pretty. I do think it will be hard (or even desirable) to fully do away with string aliases though, at least for the default data types, because this is so widespread. But IMO we should at least make the alternative to string aliases, construct dtypes programmatically, better supported and more consistent (eg so a user can just do pd.Series(..., dtype=pd.int64()) or pd.Series(..., dtype=pd.Int64Dtype()) and get the default int64 dtype based on their settings (which currently is the numpy dtype, but could also be a masked or pyarrow dtype based on their settings)).

WillAyd commented 4 months ago

So maybe then for each category in the type hierarchy above we have wrappers with signatures like:

class pd.int8(dtype_backend="pyarrow"): ...
class pd.string(dtype_backend="pyarrow", nullability="numpy"): ...
class pd.datetime(dtype_backend="pyarrow", unit="us", tz=None): ...
class pd.list(value_type, dtype_backend="pyarrow"): ...
class pd.categorical(key_type="infer", value_type="infer", dtype_backend="pandas"): ...

I know @jbrockmendel prefers something besides dtype_backend but keeping that now for consistency with the I/O methods.

Having said that, if we are to deprecate the strings, we'd probably need a PDEP for that....

I was thinking this as well

WillAyd commented 4 months ago

The current string aliases for non-default dtypes are I think mostly a band-aid to let people more easily specify those dtypes, and I fully agree those aren't very pretty. I do think it will be hard (or even desirable) to fully do away with string aliases though, at least for the default data types, because this is so widespread.

Yea this would be a long process. I think what's hard about the string alias is that it only works for very basic types. It definitely has been and would continue to be relatively easy for users to just say "int64" and get a 64 bit integer irrespective of what that is backed by, but if the user wants to then create a list column they can't just do "list".

I think users will end up with a frankenstein of string aliases alongside arguments like dtype=pd.ArrowDtype(pa.list(pa.string())) , which I find confusing

Dr-Irv commented 4 months ago

Yea this would be a long process. I think what's hard about the string alias is that it only works for very basic types. It definitely has been and would continue to be relatively easy for users to just say "int64" and get a 64 bit integer irrespective of what that is backed by, but if the user wants to then create a list column they can't just do "list".

I think users will end up with a frankenstein of string aliases alongside arguments like dtype=pd.ArrowDtype(pa.list(pa.string())) , which I find confusing

I agree. One possibility to consider is to limit the number of string aliases to simple types "int", "float", "string", "object", "datetime", "timedelta", which default to something based on default backends, and even sizes (e.g., "int" means "int64") as I guess that only a few of the strings are really used most often.

jorisvandenbossche commented 4 months ago

I found the notebook that I presented at the sprint last summer. It's a little bit off topic for the discussion just about string aliases, but I think it is relevant for the bigger picture (that we need to look at anyway if considering to move away from string aliases), so just dumping the content here (updated a little bit).


I like to have "pandas data types" with a consistent interface:

For example, for datetime-like data, we currently have:

# current options
ser.astype("datetime64[ms]") 
# vs
ser.astype("timestamp[us][pyarrow]")

# How to specify a "datetime" dtype being agnostic about the exact backend you are using?
# -> should use a single name and will pick the default backend based on your settings
ser.astype(pd.datetime("ns"))
# or
ser.astype(pd.timestamp("ns"))
# for user that want's to be explicit
ser.astype(pd.datetime("ns", backend=".."))

Another example, we currently have pd.ArrowDtype("date64") or "date64[pyarrow]", but if we want to enable a date dtype by default, users shouldn't need to know this is stored using pyarrow under the hood, so this could be pd.date() or "date"?


Logical data types vs physical data types:

For pandas, I think most users should care about logical data types, and not too much about the physical data type (and we can choose the best default, and advanced users can give hints which to use for performance optimizations)


Assuming we want a single pandas interface to all dtypes, we need to decide:

Either we can use "backend-parametrized" classes or either hide classes a bit more and use dtype constructor factory functions:

pd.StringDtype(), pd.StringDtype(backend="arrow"), pd.StringDtype(backend="numpy")
isinstance(dtype, pd.StringDtype)

-> but that means that choosing the approach of the current StringDtype with different backends instead of ArrowDtype("string")

or we could have different classes but then we definitely need the functional interface and dtype-checking helpers (because isinstance then doesn't work):

pd.string(), pd.string(backend="arrow"), pd.string(backend="numpy")
pd.api.types.is_string(..)

(and maybe pd.string(backend="arrow", storage="string_view") ?

In this case we are more free to keep whatever classes structure we want under the hood.

jbrockmendel commented 4 months ago

I forget the details, but remember finding Joris's presentation at the sprint compelling.

WillAyd commented 4 months ago

Another example, we currently have pd.ArrowDtype("date64") or "date64[pyarrow]", but if we want to enable a date dtype by default, users shouldn't need to know this is stored using pyarrow under the hood, so this could be pd.date() or "date"?

This is an interesting example, but do we even need to support the pyarrow date64? I'm not really clear what advantages that has over date32. Per the hierarchy above I would just abstract this as pd.date() which under the hood would only use pyarrow's date32. It would be a suboptimal API if we had to do something like pd.date(backend="pyarrow", size=32) but I'm not sure how likely that is.

Outside of date types I do see that issue with strings where dtype_backend="pyarrow" would leave it open to interpretation if you wanted pa.string(), pa.large_string(), or any of the other pyarrow string types you already mentioned.

Either we can use "backend-parametrized" classes or either hide classes a bit more and use dtype constructor factory functions:

In an ideal world I would be indifferent, but the problem with the class constructors is that they already exist (pd.StringDtype, pd.Int64Dtype, etc...). Repurposing them might only add to the confusion

Overall though I agree with your sentiment of starting at a place where we think in terms of logical data types foremost, which should cover the majority of use cases, and then giving some control over the physical data types via keyword arguments or options

WillAyd commented 4 months ago

To your original point, I very much agree with this (at least for the physical storage, not necessarily for nullability semantics because I personally think we should move to just having one nullability semantic, but that's the topic for another PDEP)

Is this in reference to how nulls are stored or how they are expressed to the end user? Storage-wise I feel like it would be a mistake to stray from the Arrow implementation

jorisvandenbossche commented 4 months ago

Either we can use "backend-parametrized" classes or either hide classes a bit more and use dtype constructor factory functions:

In an ideal world I would be indifferent, but the problem with the class constructors is that they already exist (pd.StringDtype, pd.Int64Dtype, etc...). Repurposing them might only add to the confusion

I think we already have both types somewhat, so we will need to clean this up to a certain extent whichever choice we make:

So while those class constructors indeed already exist, I think we have to repurpose or change the existing ones (and add new ones) to some extent anyway. And it is also not because we have those classes right now, that we can't decide we want to hide them more from the user by providing an alternative. I don't think there are already that many users that use pd.Int64Dtype() directly, and (if we would prefer that interface) there is certainly still room to start pushing a functional constructor interface.

jorisvandenbossche commented 4 months ago

To your original point, I very much agree with this (at least for the physical storage, not necessarily for nullability semantics because I personally think we should move to just having one nullability semantic, but that's the topic for another PDEP)

Is this in reference to how nulls are stored or how they are expressed to the end user? Storage-wise I feel like it would be a mistake to stray from the Arrow implementation

In the first place to how they are expressed to the end user, because IMO that's the most important aspect (since we are talking about the user interface how dtypes are specified / presented). Personally I would also prefer to use a consistent implementation storage-wise, but that's more of an implementation detail that we could discuss/compromise per dtype.