pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.42k stars 17.84k forks source link

DISC: Consider not requiring PyArrow in 3.0 #57073

Open MarcoGorelli opened 8 months ago

MarcoGorelli commented 8 months ago

TL;DR: Don't make PyArrow required - instead, set minimum NumPy version to 2.0 and use NumPy's StringDType.

Background

In PDEP-10, it was proposed that PyArrow become a required dependency. Several reasons were given, but the most significant reason was to adopt a proper string data type, as opposed to object. This was voted on and agreed upon, but there have been some important developments since then, so I think it's warranted to reconsider.

StringDType in NumPy

There's a proposal in NumPy to add a StringDType to NumPy itself. This was brought up in the PDEP-10 discussion, but at the time was not considered significant enough to delay the PyArrow requirement because:

  1. NumPy itself might not accept its StringDType proposal.
  2. NumPy's StringDType might not come with the algorithms pandas needs.
  3. pyarrow's strings might still be significantly faster.
  4. because pandas typically supports older NumPy versions (in addition to the latest release), it would be 2+ years until pandas could use NumPy's strings.

Let's tackle these in turn:

  1. I caught up with Nathan Goldbaum (author of the StringDType proposal) today, and he's said that NEP55 will be accepted (although technically still in draft status, it has several supporters and no objectors and so realistically is going to change to "accepted" very soon).

  2. The second concern was the algorithms. Here's an excerpt of the NEP I'd like to draw attention to:

    In addition, we will add implementations for the comparison operators as well as an add loop that accepts two string arrays, multiply loops that accept string and integer arrays, an isnan loop, and implementations for the str_len, isalpha, isdecimal, isdigit, isnumeric, isspace, find, rfind, count, strip, lstrip, rstrip, and replace string ufuncs [universal functions] that will be newly available in NumPy 2.0.

    So, NEP55 not only provides a NumPy StringDType, but also efficient string algorithms.

    There's a pandas fork implementing this in pandas, which Nathan has been keeping up-to-date. Once the NumPy StringDType is merged into NumPy main (likely next week) it'll be much easier for pandas devs to test it out. Note: some parts of the fork don't yet use the ufuncs, but they will do soon, it's just a matter of updating things.

    For any ufunc that's missing, Nathan's said that now that the string ufuncs framework exists in NumPy, it's relatively straightforward to add new ones (e.g. for .str.partition). There is real funding behind this work, so it's likely to keep moving quite fast.

  3. Nathan's said he doesn't have timings to hand for this comparison, and is about to go on holiday 🌴 He'll be able to provide timings in 1-2 weeks' time though.

  4. Personally, I'd be fine with requiring NumPy 2.0 as the minimum NumPy version for pandas, if it means efficient string handling by default without the need for PyArrow. Also, Nathan Goldbaum's fork already implements this for pandas. So, no need to wait 2 years, it should just be a matter of months.

Feedback

The feedback issue makes for an interesting read: https://github.com/pandas-dev/pandas/issues/54466. Complaints seem to come mostly (as far as I can tell) from other package maintainers who are considering moving away from pandas (e.g. fairlearn).

This one surprised me, I don't think anyone had considered this one before? One could argue that it's VirusTotal's issue, but still, just wanted to bring visibility to it.

Tradeoffs

In the PDEP-10 PR it was mentioned that PyArrow could help reduce some maintenance work (which, despite some funding, still seems to be mostly volunteer-driven). Has this been investigated further? Is it still likely to be the case?

Furthermore, not requiring PyArrow would mean not being able to infer list and struct dtypes by default (at least, not without significant further work).

"No is temporary, yes is forever"

I'm not saying "never require PyArrow". I'm just saying, at this point in time, I don't think the requirement is justified. Of the proposed benefits, the most salient one is strings, and now there's a realistic alternative which doesn't require taking on an extra massive dependency.

I acknowledge that lately I've been more focused on other projects, and so don't want to come across as "I'm telling pandas what to do because I know best!" (I certainly don't).

Circumstances have changed since the PDEP-10 PR and vote, and personally I regret voting the way I did. Does anyone else feel the same?

simonjayhawkins commented 5 months ago

and the PDEP text itself also just says to use "string[pyarrow]" / "the more efficient pyarrow string type" that has been available since 1.2

That was my understanding also that we would use the dtype that had be tried and tested, had been available to users for a while and that conformed to what i like to call the pandas string API (distinguishing the python backed and pyarrow backed extension arrays from the default object dtype) which included the pandas missing value indicator and returned pandas nullable dtypes where possible.

without mentioning anything about the consequences of this choice

yes, I can see that this omission is maybe another reason why we may need to revise (re-vote on) PDEP-10 if people think they would have voted differently with this being more explicitly outlined.

For lack of specifying this, preserving the default null semantics seems the better default position to me.

Maintaining the status quo is probably a good position when there isn't consensus how to proceed. However, if we are not requiring PyArrow for 3.0 then maybe we are not yet ready to change the default string dtype.

It is also better support for other dtypes that requiring PyArrow would have given. I think that PDEP-13 maybe expects PyArrow to be required?

jorisvandenbossche commented 5 months ago

maybe we are not yet ready to change the default string dtype

Can you explain why you think this is the case?

simonjayhawkins commented 5 months ago

Well, where on the roadmap do we transition to the pandas string API or is this new string dtype considered a long term solution. Also, why do we need a fallback #58451? I assume this is only needed if PyArrow is not a required dependency and if that is the case PDEP-10 is voided, including the change in string dtype, as we cannot deliver the other dtype improvements either?

jorisvandenbossche commented 5 months ago

The current StringDtype enabled by pd.options.future.infer_string = True follows the "pandas string API" if you understand this as all the string-specific functionality (mainly the .str accessor) we have and document: https://pandas.pydata.org/docs/dev/user_guide/text.html

simonjayhawkins commented 5 months ago

So this new string dtype (with NumPy semantics) is now the long term solution for pyarrow backed arrays in pandas?

lithomas1 commented 5 months ago

I would personally prefer that the numpy string semantics be (eventually) left to the numpy StringDType and that Arrow strings should have Arrow semantics.

jorisvandenbossche commented 5 months ago

So this new string dtype (with NumPy semantics) is now the long term solution for pyarrow backed arrays in pandas?

No, it is the current, potentially temporary solution for pyarrow-backed arrays which are enabled by default (so right now this is only the pyarrow-backed string dtype). And the "potentially temporary" part is dependent on a separate discussion about how to move forward to nullable dtypes in general (not specific to strings).

I would personally prefer that the numpy string semantics be (eventually) left to the numpy StringDType and that Arrow strings should have Arrow semantics.

To be clear this is not about "string semantics". The string semantics for the different variants of our StringDtype are all the same. It is about the missing value sentinel, which is not string specific. And I will argue that long term we want to have the same missing value semantics regardless of whether a dtype is backed by numpy or pyarrow. But again that is for a separate discussion (beyond 3.0). It is unfortunate that we haven't resolved the missing values discussion before doing the string dtype, causing those additional dtypes and not-completely-orthogonal discussions that are hard to separate, but that is the reality we have to live with.

lithomas1 commented 5 months ago

To be clear this is not about "string semantics". The string semantics for the different variants of our StringDtype are all the same. It is about the missing value sentinel, which is not string specific. And I will argue that long term we want to have the same missing value semantics regardless of whether a dtype is backed by numpy or pyarrow. But again that is for a separate discussion (beyond 3.0). It is unfortunate that we haven't resolved the missing values discussion before doing the string dtype, causing those additional dtypes and not-completely-orthogonal discussions that are hard to separate, but that is the reality we have to live with.

Ah ok.

So the plan is still to move towards a StringDtype backed by pd.NA if I'm understanding you correctly?

Is part of the hesitation to go directly to a StringDtype with pd.NA as the missing value because we still support doing string operations (e.g. using the .str accessor) on object columns?

If I'm following correctly, right now we have the in terms of string arrays/dtypes,

Existing

StringArray (uses object dtype - essentially "python" backed,uses pd.NA, returns nullable boolean/integer arrays in some operations) pyarrow_numpy StringArray (pyarrow backed, uses np.nan and returns numpy boolean/integer arrays in some operations)

Proposed

python_numpy StringArray (uses object dtype - essentially "python" backed, uses np.nan and returns numpy boolean/integers arrays in some operations)

where pyarrow_numpy and python_numpy are temporary "compatibility" dtypes, just so that users can upgrade to 3.0, without requiring code changes specifically for strings.

If the plan long term is to move to pd.NA, we would then deprecate these pyarrow_numpy and python_numpy dtypes for 4.0?

simonjayhawkins commented 5 months ago

So this new string dtype (with NumPy semantics) is now the long term solution for pyarrow backed arrays in pandas?

No, it is the current, potentially temporary solution for pyarrow-backed arrays which are enabled by default (so right now this is only the pyarrow-backed string dtype). And the "potentially temporary" part is dependent on a separate discussion about how to move forward to nullable dtypes in general (not specific to strings).

Thanks for clarifying.

This is why I think that maybe we are not yet ready. We make a breaking change that is considered temporary. I'm not really in favor of this but could have lived with it as a transition if we were making PyArrow a required dependency in 3.0. (IIRC it was already introduced without a PDEP and without cc the core team on the discussions until after the work had been done and some parts already merged).

I can also see that without PyArrow as a required dependency, continuing with this new string data type gives some significant performance improvements that are equivalent to those that were initially used to partially justify having PyArrow as a required dependency.

I'm not sure if others would agree, whether a retrospective PDEP for this dtype change would be beneficial so that the change has more eyes on it. Without a PDEP, and without having PyArrow as a required dependency, maybe the new string type could remain behind the future flag for now?

jorisvandenbossche commented 5 months ago

where pyarrow_numpy and python_numpy are temporary "compatibility" dtypes, just so that users can upgrade to 3.0, without requiring code changes specifically for strings.

Yes (users might still have to do some code changes just because it is no longer object dtype but a string dtype, but users shouldn't need to make code changes related to missing values because of the new string dtype)

If the plan long term is to move to pd.NA, we would then deprecate these pyarrow_numpy and python_numpy dtypes for 4.0?

If we get agreement on that long term plan for pd.NA, then yes (but the same applies to all our default dtypes right now that use NaN/NaT as missing value sentinel, also all those other dtypes will need a transition, so this is not specific to those new string dtypes)

This is why I think that maybe we are not yet ready. We make a breaking change that is considered temporary.

The "breaking" aspect (the fact that it will be a "string" dtype and no longer object dtype) is not a temporary change, though. The aspect about NaN might be temporary, but that is exactly done to make this part non-breaking at this moment (current columns with string values as object dtype also use NaN as missing value sentinel)

We are not ready to make a string dtype + NA the default, but IMO we are perfectly ready to just make a string dtype the default (without missing value change).

simonjayhawkins commented 5 months ago

We are not ready to make a string dtype + NA the default, but IMO we are perfectly ready to just make a string dtype the default (without missing value change).

This seems reasonable on the proviso that the fallback https://github.com/pandas-dev/pandas/pull/58451 behaves exactly the same (and this PR is not yet ready, in your own words "There are a bunch more of such changes needed ... to properly handle the case without pyarrow installed.", otherwise we don't address @jbrockmendel original concerns about different behaviors.

We should also probably remove the pyarrow_numpy naming, which I know you were uncomfortable with. This very naming suggests some sort of Frankenstein dtype. as @WillAyd states in https://github.com/pandas-dev/pandas/issues/54792#issuecomment-1948407232 the additional dtype could create more user confusion. So for the user they should probably only see "String" and we keep the pyarrow implementation a hidden detail.

I guess this would make the current breaking change, to a pandas native string dtype and a future breaking change to using pd.NA (either in conjunction or separate from returning pandas nullable types from operations such as str.len) more palatable.

Either way, I really think a PDEP was needed and it maybe not to late to do this? There are significant design decisions that have to date only been discussed between a small group and the more eyes the more likely a better (or the best) solution will be delivered.

simonjayhawkins commented 5 months ago

Myself and @lithomas1 are currently working on finishing a pandas string DType using the new numpy 2.0 variable length string dtype, hopefully in time for pandas 3.0. This would have to be gated behind numpy runtime version check, but also a possible option for users who have numpy 2.0 installed.

The new string dtype (PyArrow backed with NumPy schematics) that is being proposed for 3.0 (https://github.com/pandas-dev/pandas/issues/57073#issuecomment-2080683080) was originally incorporated into pandas when it was assumed that PyArrow was going to become a required dependency. The logic for this was it was felt that the consequences of having it use the pandas missing value indicator and returning other pandas nullable types had not been discussed properly in the PDEP (https://github.com/pandas-dev/pandas/issues/57073#issuecomment-2090822129)

The idea of a fallback, so that PyArrow was not needed to be a required dependency, had some pushback in the original PDEP discussion, due to both behavioral differences and performance concerns. "Offering it to users is doing them a disservice." was one comment and "Personally, I don't see a point for the string[python] implementation anymore. Non expert users end up with this dtype more often than not and wonder why it's slow." was another.

This issue was originally open with the title "DISC: Don't require PyArrow in pandas 3.0, require NumPy 2.0+, use NumPy's StringDType". Having NumPy 2.0 as a required dependency was dismissed but at the time the fallback option was not on the table, so this option was rejected.

Time has moved on and we are now again proposing a fallback option.

As @jorisvandenbossche mentions, we need an agreement on the long term plan for pd.NA for all data types so that the default pandas types are the pandas nullable types.

There are some comments here and elsewhere regarding the confusion of mixing the NumPy and PyArrow semantics. (The term PyArrow is to some extent interchangeable with pandas nullable types and NumPy is to some extent interchangable with pandas current type system). If we cannot get away from having a fallback option then a native string type for pandas using the current type system (NumPy Semantics) could maybe be best achieved using the new Numpy string dtype.

Finally, https://github.com/pandas-dev/pandas/issues/57073#issuecomment-2080683080, does not address the other dtype improvements that we planned to deliver. So personally, I disagree that the proposal honors the gist of the PDEP.

simonjayhawkins commented 5 months ago

Assuming that the fallback option could be removed once PyArrow becomes a required dependency if using the proposed new string dtype or could be removed once the minimum required version of Numpy is 2.0 if we used the new NumPy string dtype for a default pandas native dtype.

In general what do people expect these timefames to be. If we introduce a temporary transitional dtype that somehow remains the default for a longer term.

Also, once we have agreed on moving forward with the pandas nullable types as the defaults, do we expect the pandas current type system to remain for backwards compatibility? Do we then deprecate the pyarrow_numpy string dtype, keep it, or replace with the new Numpy native String dtype.

lithomas1 commented 5 months ago

I guess the next steps here would be to:

A) Formally reject/accept PDEP-10.

I will open a PR to change the status of PDEP-10 to rejected, and we should vote on that. If that vote fails, I guess we'll just have to go with accepting pyarrow as a dependency.

B) Specifically open a PDEP about untangling the string dtypes

What we should cover in this PDEP are to

  1. Fix the naming scheme on e.g. pyarrow_numpy

  2. Propose a way to be able to infer string columns as a string dtype instead of object

    • This is where Joris's proposed temporary fallback of the python_numpy (to be renamed following 1) string dtype comes in
  3. Deprecate string methods/operations on object dtype

  4. We should also cover in this PDEP what to do for both cases where either nullable dtypes (pd.NA) becomes default or not.

    • I think we should cover how to deprecate/remove the python_numpy string dtype, and potentially the pyarrow_numpy dtype here
    • The numpy based StringArray (with pd.NA as its missing value) is also a candidate for removal here
  5. Figure out whether or not to integrate numpy's StringDType

C) Push nullable dtypes to be default in some future version of pandas. This might be a topic fit for PDEP-13.

Did I summarize the conversation here accurately?

jorisvandenbossche commented 5 months ago

Either way, I really think a PDEP was needed and it maybe not to late to do this?

If my attempt to find consensus for a compromise about how to move forward on the short term for 3.0 (https://github.com/pandas-dev/pandas/issues/57073#issuecomment-2080683080) doesn't work out, then yes a PDEP will help and I am happy to write one, specifically for what I proposed above.

Personally I would only do the effort of writing it in case of 1) the scope is limited to just what to do for a default string dtype in pandas 3.0, and 2) if we can actually move forward with the implementation in parallel so that if the PDEP gets approved we are better prepared to do a 3.0 release (we have plenty of prior cases where we have had features behind an option flag before it was officially approved to make it the default, and also we actually already have this in released pandas for the pyarrow-backed version of the dtype).

jorisvandenbossche commented 5 months ago

@lithomas1 sorry, our posts crossed. I am personally fine with your top-level items, but as I mentioned in my previous post, I would personally keep a PDEP on the string dtype more limited (deprecating string methods for object dtype can be considered later, what to do when NA becomes the default can be left for the PDEP about NA, and I don't think we need to make a final decision on the usage of np.StringDType for pandas 3.0 (as IMO that also doesn't influence what we should do for pandas 3.0)).

jorisvandenbossche commented 5 months ago

My attempt at writing a PDEP for this -> https://github.com/pandas-dev/pandas/pull/58551 This tries to summarize the (long) history and context, and essentially formalizes my proposal above from a few days ago (https://github.com/pandas-dev/pandas/issues/57073#issuecomment-2080683080) about how to move forward on this topic on the short-term for pandas 3.0.

simonjayhawkins commented 5 months ago

Personally I would only do the effort of writing it in case of 1) the scope is limited to just what to do for a default string dtype in pandas 3.0, and 2) if we can actually move forward with the implementation in parallel so that if the PDEP gets approved we are better prepared to do a 3.0 release

Sure. That's a great idea. We do not need to agree on, just give due consideration to the bigger picture and also start reviewing the fallback PR and getting parts merged without any blocking as the final decision is gated behind the PDEP.

Thanks @jorisvandenbossche for putting the PDEP together. I am much more comfortable now, especially since PDEP-1 explicitly mentions Adding a new data type has impact on a variety of places that need to handle the data type. Such wide-ranging impact would require a PDEP.

I did feel that PDEP-10 had not made provision for deviating from the existing String types and also if we do not make PyArrow a required dependency we would need to revoke PDEP-10 and so would not have any formal agreement for implementing a native string type in 3.0.