pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.32k stars 17.81k forks source link

ENH: Need API support and __repr__ to discover the storage used for strings #59342

Open arnaudlegout opened 1 month ago

arnaudlegout commented 1 month ago

Originally raised in https://github.com/pandas-dev/pandas/pull/58551#discussion_r1662680953

Problem Description

With PDEP-14 there is the need for developers to be aware of the storage used for strings. Indeed, the storage might have a lot of impact of performance, for instance

Feature Description

I would like to have two way to discover the storage

Alternative Solutions

.

Additional Context

No response

jorisvandenbossche commented 1 month ago

@arnaudlegout thanks for opening the issue!

First quick note: at the moment numpy 2.0 string dtype is not supported in the pd.StringDtype at the moment (but could be in the future), so right now the two options to consider are "pyarrow" and "python" (i.e. object-dtype)

Then, the API to inspect and discover the storage is actually already available, as the .storage attribute on the StringDtype instance:

>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"], dtype="str")
>>> ser.dtype
str
>>> ser.dtype.storage
'pyarrow'

So I think the main discussion is how the __repr__ should look like.

WillAyd commented 1 month ago

I think it makes sense to have the storage and na_value as part of the repr. While @jorisvandenbossche is correct that you can inspect this with attributes, that also assumes developers know in advance what those attributes are. By putting it into the repr instead it becomes a little clearer to developers what they might need to consider

arnaudlegout commented 2 weeks ago

@WillAyd right, I was not aware of the .storage attribute and indeed getting information on the na_value is interesting.

I did not find the .storage in the pandas documentation, so it would be great to also complement the documentation to show the available attributes to inspect the storage properties.

pantheraleo-7 commented 3 hours ago

so right now the two options to consider are "pyarrow" and "python" (i.e. object-dtype)

I think the name "python" for the fallback storage option is not future proof? If I'm reading the PDEP-14 right, the fallback is a numpy array of python str objects. So the fallback storage option name should be "numpy".

numpy 2.0 string dtype is not supported in the pd.StringDtype at the moment (but could be in the future)

do we have a timeline on this? It seems like PDEP-10 will be reverted by PDEP-15, so pyarrow is going to stay an optional dependency. So to force users who just want the vectorisation speed benefits (and nothing more) to install pyarrow will practically lessen the importance of numpy 2.0 string implementation as they would've have already moved to pyarrow in pandas 3.0.

pandas 3.0 is a golden opportunity to incorporate numpy 2.0 string dtype, as users who will shift to a newer major version of pandas, would also most likely shift to a newer major version of numpy.

but if we still don't want to force numpy 2.0, we could have an intermediate fallback no? use pyarrow if installed >>> use numpy 2.0 str dtype if numpy>=2.0 is installed >>> use numpy object dtype

basically I'm saying we should fast track numpy 2.0 string implementation xD

WillAyd commented 3 hours ago

I think a numpy 2.0 string data type would needs it own PDEP. We already have a proliferation of string data types in pandas, so it needs some discussion to define what value we see from adding another, and to define what the semantics of it are.