Open arnaudlegout opened 1 month ago
@arnaudlegout thanks for opening the issue!
First quick note: at the moment numpy 2.0 string dtype is not supported in the pd.StringDtype
at the moment (but could be in the future), so right now the two options to consider are "pyarrow"
and "python"
(i.e. object-dtype)
Then, the API to inspect and discover the storage is actually already available, as the .storage
attribute on the StringDtype instance:
>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"], dtype="str")
>>> ser.dtype
str
>>> ser.dtype.storage
'pyarrow'
So I think the main discussion is how the __repr__
should look like.
I think it makes sense to have the storage and na_value as part of the repr. While @jorisvandenbossche is correct that you can inspect this with attributes, that also assumes developers know in advance what those attributes are. By putting it into the repr instead it becomes a little clearer to developers what they might need to consider
@WillAyd right, I was not aware of the .storage
attribute and indeed getting information on the na_value is interesting.
I did not find the .storage
in the pandas documentation, so it would be great to also complement the documentation to show the available attributes to inspect the storage properties.
so right now the two options to consider are "pyarrow" and "python" (i.e. object-dtype)
I think the name "python" for the fallback storage option is not future proof? If I'm reading the PDEP-14 right, the fallback is a numpy array of python str
objects. So the fallback storage option name should be "numpy".
numpy 2.0 string dtype is not supported in the pd.StringDtype at the moment (but could be in the future)
do we have a timeline on this? It seems like PDEP-10 will be reverted by PDEP-15, so pyarrow is going to stay an optional dependency. So to force users who just want the vectorisation speed benefits (and nothing more) to install pyarrow will practically lessen the importance of numpy 2.0 string implementation as they would've have already moved to pyarrow in pandas 3.0.
pandas 3.0 is a golden opportunity to incorporate numpy 2.0 string dtype, as users who will shift to a newer major version of pandas, would also most likely shift to a newer major version of numpy.
but if we still don't want to force numpy 2.0, we could have an intermediate fallback no? use pyarrow if installed >>> use numpy 2.0 str dtype if numpy>=2.0 is installed >>> use numpy object dtype
basically I'm saying we should fast track numpy 2.0 string implementation xD
I think a numpy 2.0 string data type would needs it own PDEP. We already have a proliferation of string data types in pandas, so it needs some discussion to define what value we see from adding another, and to define what the semantics of it are.
Originally raised in https://github.com/pandas-dev/pandas/pull/58551#discussion_r1662680953
Problem Description
With PDEP-14 there is the need for developers to be aware of the storage used for strings. Indeed, the storage might have a lot of impact of performance, for instance
pyarrow
storageChunkedArray
)python
storagenumpy
2.0 strings storage (I don't have a good knowledge of these new strings, and never tested them)Feature Description
I would like to have two way to discover the storage
__repr__
goal is to give information on the inner of an object, one option suggested by @jorisvandenbossche is to display<pandas.StringDtype(storage=...)>
instead ofstring[storage]
.get_storage
that returns the storage (not sure what is possible with the current implementation, would be best to have a class, otherwise, a string). The API is useful to check before running a time consuming code that we have the correct storage.Alternative Solutions
.
Additional Context
No response