pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.6k stars 17.9k forks source link

API: public way to get the "best values" of a Series? #29267

Open jorisvandenbossche opened 4 years ago

jorisvandenbossche commented 4 years ago

For a use case in pyarrow, I need to get the underlying values of a Series: an ExtensionArray if it is back with one, or otherwise the numpy array.

Is there public API to get this? We have Series.array, but this always returns an ExtensionArray, and we have Series.values, but this return a numpy array for eg periods (for historical reasons).

For Index, we have the private Index._values described in the docstring as the "best array representation". And I think Series._values is somewhat similar.

But the question is: do we want a public way to get to this? I am personally not sure we should, as there are still dubious cases (like datetime64/timedelta64, for this one, Index._values and Series._values is actually different ...).

But if we don't add it, do we have a recommended way for external projects to do this? (basically it is something like the extract_array(..., extract_numpy=True) ?)

cc @TomAugspurger @jbrockmendel @jreback

TomAugspurger commented 4 years ago

I think "best values" is a bit too fuzzy of a concept for a public API. It depends too much on the context of what you're doing with the array (Do you need an ndarray? Do you need zero-copy? etc.)

On Tue, Oct 29, 2019 at 9:27 AM Joris Van den Bossche < notifications@github.com> wrote:

For a use case in pyarrow, I need to get the underlying values of a Series: an ExtensionArray if it is back with one, or otherwise the numpy array.

Is there public API to get this? We have Series.array, but this always returns an ExtensionArray, and we have Series.values, but this return a numpy array for eg periods (for historical reasons).

For Index, we have the private Index._values described in the docstring as the "best array representation". And I think Series._values is somewhat similar.

But the question is: do we want a public way to get to this? I am personally not sure we should, as there are still dubious cases (like datetime64/timedelta64, for this one, Index._values and Series._values is actually different ...).

But if we don't add it, do we have a recommended way for external projects to do this? (basically it is something like the extract_array(..., extract_numpy=True) ?)

cc @TomAugspurger https://github.com/TomAugspurger @jbrockmendel https://github.com/jbrockmendel @jreback https://github.com/jreback

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/29267?email_source=notifications&email_token=AAKAOITPKELWTIGNPS7ZLDTQRBB3XA5CNFSM4JGJOHG2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HVCMLIA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIXBBSKU4S7OGRAN65DQRBB3XANCNFSM4JGJOHGQ .

jorisvandenbossche commented 4 years ago

Yeah, I know. It's just that when I realized while doing my arrow PR that the cause of a bug was us doing series.values, but that neither of series.array or series._values would solve it (none of them gives that specific behaviour that I want in this case), I was thinking "what a mess we made in pandas" ;-)

But indeed, given the specific requirements, probably best that everybody handles this in a custom way. I will probably end up with something like (but then much more complex to handle different pandas versions):

# we already know obj is a series
if isinstance(obj.dtype, (pd.PeriodDtype, pd.IntervalDtype)):
    return obj.array
else:
    return obj.values

because "new" EAs (integer, string) already get returned from .values, while for the "old" datetimetz was already handled in a special case anyway (and the ndarray only looses the tz information). So in the end it was only for periods/interval that I lost the dtype.

jbrockmendel commented 4 years ago

I think of extract_array as being the best option at least internally since it is non-lossy and non-costly. But I think its pretty similar to ser._values, so not exactly what you're looking for.

WillAyd commented 4 years ago

Series.values, but this return a numpy array for eg periods (for historical reasons).

What are those historical reasons? Wondering if from an API perspective it might work to make array -> ExtensionArray, to_numpy -> NumPy Array and .values one or the other. Adding another item here might just make things even more confusing

jorisvandenbossche commented 4 years ago

What are those historical reasons?

We had Periods before we had ExtensionArrays, so before, if you stored periods in a column, .values would give an object array. When we introduced EAs, we decided to keep that behaviour (and that is one of the reasons we have .array)

Wondering if from an API perspective it might work to make array -> ExtensionArray, to_numpy -> NumPy Array and .values one or the other.

That's already the case right now. The main problem is that the ".values one or the other" is a bit inconsistent (due to the historical reasons above) in when it returns the one or when the other.

Adding another item here might just make things even more confusing

Yes, that's for sure. But a public method is not necessarily needed as an attribute on the object. We could also expose eg something like extract_array in pandas.api.extensions.

Now, given the above discussion, I agree that which one you exactly want is probably rather application dependent. So maybe it is fine to leave it as is.

jbrockmendel commented 4 years ago

Adding another item here might just make things even more confusing

This is the thing I agree with the most. We've also got _ndarray_values and values_from_object

jorisvandenbossche commented 4 years ago

But I think its pretty similar to ser._values, so not exactly what you're looking for

That's not a public method.

This is the thing I agree with the most. We've also got _ndarray_values and values_from_object

To repeat from above: it doesn't need to be an attribute. We could also expose extract_array publicly in pandas.api.types or pandas.api.extensions.