ENH: Need API support and __repr__ to discover the storage used for strings

arnaudlegout commented 1 month ago

Originally raised in https://github.com/pandas-dev/pandas/pull/58551#discussion_r1662680953

Problem Description

With PDEP-14 there is the need for developers to be aware of the storage used for strings. Indeed, the storage might have a lot of impact of performance, for instance

pyarrow storage
- pros: compact (optimal memory footprint), fast (vectorization)
- cons: immutable (so any modification creates a new string pyarrow ChunkedArray)
python storage
- pros: mutable
- cons: highest memory footprint (each string is a different Python object), slow (no vectorization)
numpy 2.0 strings storage (I don't have a good knowledge of these new strings, and never tested them)
- pros: compact, vectorization, mutable (my understanding is that is takes more space and is slower than pyarrow strings)
- cons: different representations depending on a string size, which make understanding performance harder

Feature Description

I would like to have two way to discover the storage

__repr__ goal is to give information on the inner of an object, one option suggested by @jorisvandenbossche is to display <pandas.StringDtype(storage=...)> instead of string[storage]
.get_storage that returns the storage (not sure what is possible with the current implementation, would be best to have a class, otherwise, a string). The API is useful to check before running a time consuming code that we have the correct storage.

Alternative Solutions

.

Additional Context

No response

jorisvandenbossche commented 1 month ago

@arnaudlegout thanks for opening the issue!

First quick note: at the moment numpy 2.0 string dtype is not supported in the pd.StringDtype at the moment (but could be in the future), so right now the two options to consider are "pyarrow" and "python" (i.e. object-dtype)

Then, the API to inspect and discover the storage is actually already available, as the .storage attribute on the StringDtype instance:

>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"], dtype="str")
>>> ser.dtype
str
>>> ser.dtype.storage
'pyarrow'

So I think the main discussion is how the __repr__ should look like.

WillAyd commented 1 month ago

I think it makes sense to have the storage and na_value as part of the repr. While @jorisvandenbossche is correct that you can inspect this with attributes, that also assumes developers know in advance what those attributes are. By putting it into the repr instead it becomes a little clearer to developers what they might need to consider

arnaudlegout commented 2 weeks ago

@WillAyd right, I was not aware of the .storage attribute and indeed getting information on the na_value is interesting.

I did not find the .storage in the pandas documentation, so it would be great to also complement the documentation to show the available attributes to inspect the storage properties.

pantheraleo-7 commented 3 hours ago

so right now the two options to consider are "pyarrow" and "python" (i.e. object-dtype)

I think the name "python" for the fallback storage option is not future proof? If I'm reading the PDEP-14 right, the fallback is a numpy array of python str objects. So the fallback storage option name should be "numpy".

when numpy 2.0 strings will be implemented as a fallback, the name "python" won't make sense anymore
it kinda don't make sense even right now because we are storing those objects in a numpy array anyway
also, the names "pyarrow" and "numpy" would complement each other better ig

numpy 2.0 string dtype is not supported in the pd.StringDtype at the moment (but could be in the future)

do we have a timeline on this? It seems like PDEP-10 will be reverted by PDEP-15, so pyarrow is going to stay an optional dependency. So to force users who just want the vectorisation speed benefits (and nothing more) to install pyarrow will practically lessen the importance of numpy 2.0 string implementation as they would've have already moved to pyarrow in pandas 3.0.

pandas 3.0 is a golden opportunity to incorporate numpy 2.0 string dtype, as users who will shift to a newer major version of pandas, would also most likely shift to a newer major version of numpy.

but if we still don't want to force numpy 2.0, we could have an intermediate fallback no? use pyarrow if installed >>> use numpy 2.0 str dtype if numpy>=2.0 is installed >>> use numpy object dtype

basically I'm saying we should fast track numpy 2.0 string implementation xD

WillAyd commented 3 hours ago

I think a numpy 2.0 string data type would needs it own PDEP. We already have a proliferation of string data types in pandas, so it needs some discussion to define what value we see from adding another, and to define what the semantics of it are.

pandas-dev / pandas