pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.42k stars 17.84k forks source link

DISC: Consider not requiring PyArrow in 3.0 #57073

Open MarcoGorelli opened 8 months ago

MarcoGorelli commented 8 months ago

TL;DR: Don't make PyArrow required - instead, set minimum NumPy version to 2.0 and use NumPy's StringDType.

Background

In PDEP-10, it was proposed that PyArrow become a required dependency. Several reasons were given, but the most significant reason was to adopt a proper string data type, as opposed to object. This was voted on and agreed upon, but there have been some important developments since then, so I think it's warranted to reconsider.

StringDType in NumPy

There's a proposal in NumPy to add a StringDType to NumPy itself. This was brought up in the PDEP-10 discussion, but at the time was not considered significant enough to delay the PyArrow requirement because:

  1. NumPy itself might not accept its StringDType proposal.
  2. NumPy's StringDType might not come with the algorithms pandas needs.
  3. pyarrow's strings might still be significantly faster.
  4. because pandas typically supports older NumPy versions (in addition to the latest release), it would be 2+ years until pandas could use NumPy's strings.

Let's tackle these in turn:

  1. I caught up with Nathan Goldbaum (author of the StringDType proposal) today, and he's said that NEP55 will be accepted (although technically still in draft status, it has several supporters and no objectors and so realistically is going to change to "accepted" very soon).

  2. The second concern was the algorithms. Here's an excerpt of the NEP I'd like to draw attention to:

    In addition, we will add implementations for the comparison operators as well as an add loop that accepts two string arrays, multiply loops that accept string and integer arrays, an isnan loop, and implementations for the str_len, isalpha, isdecimal, isdigit, isnumeric, isspace, find, rfind, count, strip, lstrip, rstrip, and replace string ufuncs [universal functions] that will be newly available in NumPy 2.0.

    So, NEP55 not only provides a NumPy StringDType, but also efficient string algorithms.

    There's a pandas fork implementing this in pandas, which Nathan has been keeping up-to-date. Once the NumPy StringDType is merged into NumPy main (likely next week) it'll be much easier for pandas devs to test it out. Note: some parts of the fork don't yet use the ufuncs, but they will do soon, it's just a matter of updating things.

    For any ufunc that's missing, Nathan's said that now that the string ufuncs framework exists in NumPy, it's relatively straightforward to add new ones (e.g. for .str.partition). There is real funding behind this work, so it's likely to keep moving quite fast.

  3. Nathan's said he doesn't have timings to hand for this comparison, and is about to go on holiday 🌴 He'll be able to provide timings in 1-2 weeks' time though.

  4. Personally, I'd be fine with requiring NumPy 2.0 as the minimum NumPy version for pandas, if it means efficient string handling by default without the need for PyArrow. Also, Nathan Goldbaum's fork already implements this for pandas. So, no need to wait 2 years, it should just be a matter of months.

Feedback

The feedback issue makes for an interesting read: https://github.com/pandas-dev/pandas/issues/54466. Complaints seem to come mostly (as far as I can tell) from other package maintainers who are considering moving away from pandas (e.g. fairlearn).

This one surprised me, I don't think anyone had considered this one before? One could argue that it's VirusTotal's issue, but still, just wanted to bring visibility to it.

Tradeoffs

In the PDEP-10 PR it was mentioned that PyArrow could help reduce some maintenance work (which, despite some funding, still seems to be mostly volunteer-driven). Has this been investigated further? Is it still likely to be the case?

Furthermore, not requiring PyArrow would mean not being able to infer list and struct dtypes by default (at least, not without significant further work).

"No is temporary, yes is forever"

I'm not saying "never require PyArrow". I'm just saying, at this point in time, I don't think the requirement is justified. Of the proposed benefits, the most salient one is strings, and now there's a realistic alternative which doesn't require taking on an extra massive dependency.

I acknowledge that lately I've been more focused on other projects, and so don't want to come across as "I'm telling pandas what to do because I know best!" (I certainly don't).

Circumstances have changed since the PDEP-10 PR and vote, and personally I regret voting the way I did. Does anyone else feel the same?

mroeschke commented 8 months ago

TLDR: I am +1 not making pyarrow a required dependency in pandas 3.0. I am -1 on making NumPy 2.0 the min version and numpy StringDtypes the default in pandas 3.0. Keep the status quo in 3.0.

A few thoughts:

  1. numpy StringDtype will still be net new in 2.0. While I expect the new type to be robust and more performant than object, I think with any new feature it should be opt-in first before being made the default as the scope of edge case incompatibility is unknown. pyarrow strings have been around since 1.3 and was not until recently decided to become the default (I understand it's a different type system too).

  2. I have a biased belief that pyarrow type system with it's nullability and support for more types would be a net benefit for users, but I understand that the current numpy type system is "sufficient". It would be cool to allow users to use pyarrow types everywhere in pandas by default, but making that opt-in I think is a likely end state for pyarrow + pandas.

WillAyd commented 8 months ago

I think we should still stick with PDEP-10 as is; even if user benefit 1 wasn't as drastic as envisioned, I still think benefits 2 and 3 help immensely.

Generally the story around pandas type system is very confusing; I am hopeful that moving towards the arrow type system solves that over time

jorisvandenbossche commented 8 months ago

Personally I am in favor of keeping pyarrow optional (although I voted for the PDEP, because I find it more important to have a proper string dtype). But I also agree with Matt that it seems to fast to require numpy >= 2 for pandas (not only because the string dtype is very new, but also just because this will be annoying for the ecosystem to require such a new version of numpy that many other packages will not yet be compatible with).

If we want a simple alternative to keep pyarrow optional, I don't think we need to use numpy's new string dtype, though. We already have a object-dtype based StringDtype that can be the fallback when pyarrow is not installed. User still get the benefit of a new default, proper string dtype in 3.0 in all cases, but if they also want the performance improvements of the new string dtype, they need to have pyarrow installed. Then it's up to users to make that trade-off (and we can find ways to strongly encourage users to use pyarrow).

I would also like to suggest another potential idea to consider: we could adopt Arrow's type (memory model) for strings, but without requiring pyarrow the package. Building on @WillAyd's work in https://github.com/pandas-dev/pandas/pull/54506 using nanoarrow to use bitmaps in pandas, we could implement a basic StringArray in a similar way, and implement the basic required features in pandas itself (things like getitem, take, isna, unique/factorize), and then for the string-specific methods either use pyarrow if installed, or fallback to Python's string methods otherwise (or if we could vendor some code for this, progressively implement some string methods ourselves).
This of course requires a decent chunk of work in pandas itself. But with the advantages that this keeps compatibility with the Arrow type system (and zero-copy conversion to/from Arrow), and also already gives some advantages for the case pyarrow is not installed (improved memory usage, performance improvements for a subset of methods).

lithomas1 commented 8 months ago

TLDR: I am +1 not making pyarrow a required dependency in pandas 3.0. I am -1 on making NumPy 2.0 the min version and numpy StringDtypes the default in pandas 3.0. Keep the status quo in 3.0.

+1 on this as well. IMO, it's too early to require numpy 2.0 (since it's pretty hard to adapt to the changes).

cc @pandas-dev/pandas-core

datapythonista commented 8 months ago

+1 on not requiring numpy 2 for pandas 3.

I'm fine to continue as planned with the PDEP. If we consider the option of another Arrow implementation replacing PyArrow, ot feels like using Arrow-rs is a better option than nanoarrow to me (at least an option also worth considering). Last time this was discussed it wasn't clear what would happen with the two Rust implementations, but now everybody (except Polars for now) is settled on Arrow-rs and Arrow2 is discontinued. So, things are stable.

If there is interest, I can research further and work on a prototype.

Dr-Irv commented 8 months ago

I think we should wait for more feedback in #54466 . pandas 2.2 was released only 11 days ago. I say we give it a month, or maybe until the end of February, and then make a decision. The whole point of gaining feedback was to give us a chance to revisit the decision to make pyarrow a required dependency. Seems like our options at this point with pandas 3.0 are:

  1. Require pyarrow as planned from PDEP-10
  2. Require numpy 2.0 and use numpy implementation for strings.
  3. Postpone to a later date any requirement for pyarrow - make it optional but allow people to get better string performance by opting in.

Given the feedback so far and the arguments that @MarcoGorelli gives above and the other comments, I'm leaning towards (3), but I'd like to see more feedback from the community at large.

lithomas1 commented 8 months ago

IMO, I think we should make a decision by the next dev call(Feb. 7th I think?).

I'm probably going to release 2.2.1 at most 2 weeks after numpy releases the 2.0rc (so probably around Feb. 14, assuming the numpy 2.0 releases on schedule on Feb 1), and I think we should decide whether to roll back the warning for 2.2.1, to avoid confusion.

datapythonista commented 8 months ago

I did a quick test on how big it'd be a binary using Arrow-rs (Rust). In general in Rust only static linking is used, so just one .so and no dependencies would be needed. A sample library using Arrow-rs with the default components (arrow-json, arrow-ipc...) compiles to a file around 500kb. In that sense, the Arrow-rs approach would solve the installation and size issues. Of course this is not an option for pandas 3.0, and it requires a non-trivial amount of work.

Something that can make this happen quicker and with less effort is implementing the same PyArrow API for Arrow-rs for the parts we need. In theory, that would allow to simply replace PyArrow by the new package and update the imports.

If there is interest in giving this a try, I'd personally change my vote here from requiring PyArrow in pandas 3, to keep the status quo for now.

simonjayhawkins commented 8 months ago

IMO, I think we should make a decision by the next dev call(Feb. 7th I think?).

I assume that the decision would be whether we plan to revise the PDEP and then go through the PDEP process again for the revised PDEP?

The PDEP process was created not only that decisions have sufficient discussion and visibility but also that once agreed people could then work towards the agreed changes/improvements without being VETOd by individual maintainers.

In this case, however, it maybe that several maintainers would vote differently now.

Does our process allow us to re vote on the existing PDEP? (given that the PDEP did include the provision to collect feedback from the community)

Does the outcome of any discussions/decisions on this affect whether the next pandas version is 3.0 or 2.3?

attack68 commented 8 months ago

Agree with Simon, this concern was discussed as part of the original PDEP (https://github.com/pandas-dev/pandas/pull/52711#discussion_r1185126720) with some timelines discussed and the vote was still approved. I somewhat expected some of the pushback from developers of web-apps so am supportive of this new proposal and my original vote, but it needs to fit in with the governance established, and should possibly also be cautious of any development that has taken place in H2 '23 that has been done in anticipation of the implementation of the PDEP. I would expect the approved PDEP to continue to steer the development until formally agreed otherwise. I don't see a reason why a new PDEP could not be proposed to alter/amend the previous, particularly if there already seemed to be enough support to warrant one.

WillAyd commented 8 months ago

I would also like to suggest another potential idea to consider: we could adopt Arrow's type (memory model) for strings, but without requiring pyarrow the package. Building on @WillAyd's work in #54506 using nanoarrow to use bitmaps in pandas, we could implement a basic StringArray in a similar way, and implement the basic required features in pandas itself (things like getitem, take, isna, unique/factorize), and then for the string-specific methods either use pyarrow if installed, or fallback to Python's string methods otherwise (or if we could vendor some code for this, progressively implement some string methods ourselves).

With @jorisvandenbossche idea I wanted to try and implement an Extension-Array compatable StringArray using nanoarrow. Some python idioms like negative indexing aren't yet implemented, and there was a limitation around classmethods I haven't worked around, but otherwise I did get implement this here:

https://github.com/WillAyd/nanopandas/tree/7e333e25b1b4027e49b9d6ad2465591abf0c9b27

I also implented some of the optional interface items like unique, fillna and dropna alongside a few str accessor methods

Of course benchmarking this would take some effort, but I think most of the algorithms we would need are pretty simple.

simonjayhawkins commented 8 months ago

Personally I am in favor of keeping pyarrow optional (although I voted for the PDEP, because I find it more important to have a proper string dtype). But I also agree with Matt that it seems to fast to require numpy >= 2 for pandas (not only because the string dtype is very new, but also just because this will be annoying for the ecosystem to require such a new version of numpy that many other packages will not yet be compatible with).

I too was keen to keep pyarrow optional but voted for the PDEP for the benefits for other dtypes.

From the PDEP... "Starting in pandas 3.0, the default type inferred for string data will be ArrowDtype with pyarrow.string instead of object. Additionally, we will infer all dtypes that are listed below as well instead of storing as object."

If we want a simple alternative to keep pyarrow optional, I don't think we need to use numpy's new string dtype, though. We already have a object-dtype based StringDtype that can be the fallback when pyarrow is not installed. User still get the benefit of a new default, proper string dtype in 3.0 in all cases, but if they also want the performance improvements of the new string dtype, they need to have pyarrow installed. Then it's up to users to make that trade-off (and we can find ways to strongly encourage users to use pyarrow).

IIRC I also made this point in the original discussion but there was pushback to having the object backed StringDType as the default if pyarrow is not installed that included not only concerns about performance but also regarding different behavior depending on if a dependency was installed. (The timelines for NumPy's StringDtype precluded that as an option to address the performance concerns)

However, I did not push this point once the proposal was expanded to dtypes other that strings.

I would also like to suggest another potential idea to consider: we could adopt Arrow's type (memory model) for strings, but without requiring pyarrow the package. Building on @WillAyd's work in #54506 using nanoarrow to use bitmaps in pandas, we could implement a basic StringArray in a similar way, and implement the basic required features in pandas itself (things like getitem, take, isna, unique/factorize), and then for the string-specific methods either use pyarrow if installed, or fallback to Python's string methods otherwise (or if we could vendor some code for this, progressively implement some string methods ourselves).

Didn't we also discuss using, say nanoarrow? (Or am I mixing up the discussion on requiring pyarrow for the I/O interface.)

If this wasn't discussed then a new/further discussion around this option would add value ( https://github.com/pandas-dev/pandas/issues/57073#issuecomment-1925417591) especially since @WillAyd is actively working on this.

WillAyd commented 8 months ago

Another advantage of building on top of nanoarrow is we would have the ability to implement our own algorithms to fit the needs of pandas. Here is a quick benchmark of the nanopandas isna() implementation versus pandas:

In [3]: import nanopandas as nanopd

In [4]: import pandas as pd

In [5]: arr = nanopd.StringArray([None, "foo"] * 1_000_000)

In [6]: ser = pd.Series([None, "foo"] * 1_000_000, dtype="string[pyarrow]")

In [7]: arr.isna().to_pylist() == list(ser.isna())
Out[7]: True

In [8]: %timeit arr.isna()
10.7 µs ± 45 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [9]: %timeit ser.isna()
2 ms ± 43.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

That's about a 200x speedup. Of course its not a fair comparison because the pandas arrow extension implementation calls to_numpy(), but we in theory would have more flexibility to avoid that copy to numpy if we take on more management of the lower level.

jreback commented 8 months ago

going down this path is a tremendous amount of work - replicating pyarrow effectively

this should not be taken lightly - the purpose of having pyarrow as a required dependency is to reduce overall project requirements

WillAyd commented 8 months ago

The added workload is a very valid concern, though its definitely not on the scale of replicating pyarrow. We should just be using the nanoarrow API and not even managing memory since the nanoarrow C++ helpers can do that for you.

datapythonista commented 8 months ago

While I'm surely +1 on replacing PyArrow by a better implementation, I guess the proposal is not to implement the string algorithms in nanoarrow and make this the default for pandas 3.0, right?

So,I think in some months we can have pandas strings based in:

Besides the existing ones based in NumPy objects and PyArrow.

To narrow the discussion, I think we need to decide somehow independently:

simonjayhawkins commented 8 months ago

Yes, I too think that full implementations may be too ambitious and may not even be necessary (performance-wise). I would think that these implementations would only be needed as fallbacks if we were to u-turn on the pyarrow requirement so that we could move forward with defaulting to arrow memory backed arrays for the dtypes listed in the PDEP for pandas 3.0.

The feedback from users is not unexpected and was discussed (other than the noise regarding warnings)

As @attack68 said, "I would expect the approved PDEP to continue to steer the development until formally agreed otherwise.", i.e. through the PDEP revision procedure.

jorisvandenbossche commented 8 months ago

Good idea to distinguish the shorter and longer term. But on the short term options:

What do we do for pandas 3? I think the only reasonable options are require PyArrow and have pyarrow strings by default, or keep the status quo

No, as I mentioned above, IMO we already have an obvious fallback for pyarrow, which is the object-dtype backed StringDtype. So I think making that the default (if pyarrow is not installed) is another very reasonable option for pandas 3.0.

(personally, for me the most important thing is that I want to see string in the df.dtypes for columns with strings, so that we can stop explaining to users "if you see "object" that's probably a column with strings". How it's exactly implemented under the hood is more of a detail, although I certainly want a more performant implementation as well).

Simon brought up the point of having two implementations with slight differences in behaviour:

If we want a simple alternative to keep pyarrow optional, I don't think we need to use numpy's new string dtype, though. We already have a object-dtype based StringDtype that can be the fallback when pyarrow is not installed. ...

IIRC I also made this point in the original discussion but there was pushback to having the object backed StringDType as the default if pyarrow is not installed that included not only concerns about performance but also regarding different behavior depending on if a dependency was installed. (The timelines for NumPy's StringDtype precluded that as an option to address the performance concerns)

And that's certainly a valid downside of this (I seem to remember we have had issues with this in the past with different behaviour when numexpr or bottleneck was installed and "silently" used by default).
I do wonder however if we have an idea of whether there are many behaviour differences we are aware of from testing and user reports of the arrow-backed StringDtype over the last years (I know one reported to pyarrow about a different upper case for ß in https://github.com/apache/arrow/issues/34599). I don't know if we have some skips/special cases in our tests because of behaviour differences.

This might also be unavoidable in general, for other data types as well. It seems likely that also for numeric data we will have a numpy-based and pyarrow-based implementation side by side for some time, and also there there will likely be slight differences in behaviour.

simonjayhawkins commented 8 months ago
  • What do we do for pandas 3? I think the only reasonable options are require PyArrow and have pyarrow strings by default, or keep the status quo

Yes, it is all to easy for the discussion to go off at tangents and this issue was opened with the suggestion of requiring Numpy 2.0+

It appears there is no support for this at all?

The other question raised was whether anyone would vote differently now. For this, however, it does appear that several maintainers would. For those that would, it would be interesting to know explicitly what changes to the PDEP would be expected.

Or to keep the status quo, we would somehow need a re-vote on the current PDEP revision.

to be clear, without any changes to the PDEP, I would not change my vote. I do not regret the decision since my decision was based on better data types other than just strings and discussions around a better string data type do not fully address this.

jorisvandenbossche commented 8 months ago

this issue was opened with the suggestion of requiring Numpy 2.0+. It appears there is no support for this at all?

Unless we want to change the timeline for 3.0 (or delay the introduction of the string dtype to a later pandas release), I think it's not very realistic. To start, this change hasn't even landed yet on numpy main. I think it will also be annoying for pandas to strictly requires numpy >= 2.0 that soon (given numpy 2.0 itself is also a breaking release). Further, numpy only implements a subset of string kernels (for now), so even when using numpy for the memory layout, we will still need a fallback to Python objects for quite some of our string methods. Given the last item, we also would want to keep the option to use PyArrow for strings as well, resulting in this double implementation anyway (with the possible behaviour differences). At that point, I think the easier option is to use the object-dtype fallback instead of a new numpy-2 based fallback.

datapythonista commented 8 months ago

Sorry, I missed that option @jorisvandenbossche. I personally don't like using string[object] by default, it doesn't add value in functionality and performance, and makes users have to learn more cumbersome things. But it's an option, so for pandas 3 we have:

  1. Continue with the agreed PDEP and require PyArrow
  2. "Cancel" the PDED and continue with the object type
  3. Use the string dtype backed by NumPy objects

For long term we have:

While I don't think we need a full rewrite of PyArrow, I think we need next things in any Arrow implementation we use to be functional (only string operations don't seem useful to me, as it'd still require PyArrow anyway to have a dataframe with Arrow columns):

I think for the nanoarrow approach we need all this, which does feel like almost rewriting PyArrow from scratch. Also, do we have a reason to think the new implementation will be smaller than Arrow? What do you think @WillAyd? Maybe I'm missing something here.

While Arrow-rs doesn't have Rust bindings, Datafusion does. It seems to provide all or most of what we need to fully replace PyArrow. The Python package doesn't have dependencies and it requires 43Mb. Quite big, but less than half of PyArrow. The build should be just standard building, I think that was another issue with PyArrow. I think it's an option worth exploring.

jorisvandenbossche commented 8 months ago

Sorry, I missed that option @jorisvandenbossche. I personally don't like using string[object] by default, it doesn't add value in functionality and performance, and makes users have to learn more cumbersome things.

Don't worry ;) But can you clarify which cumbersome things users would have to learn? For the average user, whether we use a pyarrow string array or a numpy object array under the hood, that's exactly the same and you shouldn't notice that (except for performance differences, of course).
While it indeed doesn't give any performance benefits, IMO it gives a big functionality advantage in simply having a string dtype, compared the current catch-all object dtype (that's one of the main reasons we added this object-dtype based StringDtype already in pandas 1.0, https://pandas.pydata.org/docs/dev/whatsnew/v1.0.0.html#dedicated-string-data-type). Functionality-wise, there is actually basically no difference between the object or pyarrow based StringArray (with the exception of a few corner cases where pyarrow doesn't have an implementation and the pyarrow-backed array still falls back to Python).

datapythonista commented 8 months ago

I was rereading your original comment and I realize now that your initial proposal is to make the default the PyArrow string type, except when PyArrow is not installed, right? In the last comment sounded like you wanted to always default to the string object type, that's what I find complex to learn (considering what users already know about object...).

String PyArrow type as default and string object as fallback seems like a reasonable trade off to me.

simonjayhawkins commented 8 months ago

String PyArrow type as default and string object as fallback seems like a reasonable trade off to me.

yes. we had this discussion in the original PDEP starting around https://github.com/pandas-dev/pandas/pull/52711#issuecomment-1618007396 following on from a discussion around the NumPy string type and yet still voted (as a team) on requiring PyArrow as a required dependency.

What I see as new to the discussion is considering using nanoarrow to instead provide some sort of fallback option, not the default.

I see this could potentially address some of the concerns around data types other than strings eg. https://github.com/pandas-dev/pandas/pull/52711#issuecomment-1518717765

WillAyd commented 8 months ago

To be clear at no point was I suggesting we rewrite pyarrow; what I showed is simply an extension array that uses arrow native storage. That is a much smaller scope than what some of the discussions here have veered towards

I don't think any of the arguments raised in this discussion are a surprise and I still vote to stick with the PDEP. I think if that got abandoned but we still wanted Arrow strings without pyarrow then the extension array I showcased above may be a reasonable fallback and may even be easier to integrate than a "string[object]" fallback because at the raw storage level it fits the same mold as a pyarrow string array

simonjayhawkins commented 8 months ago

I don't think any of the arguments raised in this discussion are a surprise and I still vote to stick with the PDEP. I think if that got abandoned but we still wanted Arrow strings without pyarrow then the extension array I showcased above may be a reasonable fallback and may even be easier to integrate than a "string[object]" fallback because at the raw storage level it fits the same mold as a pyarrow string array

Thanks @WillAyd for elaborating.

I think if the PDEP was revised to include something like this (not requiring pyarrow but if not installed defaulting to pyarrow memory backed array on array construction but limited functionality and advising users to install pyarrow) I would perhaps vote differently now.

So agree that, at this point in time, these alternatives perhaps only need discussion if enough people are strongly against requiring pyarrow as planned.

jbrockmendel commented 8 months ago

I'd like to better-understand how the numpy string will differ from a pyarrow string. In particular, will converting between them be zero-copy?

Short term, could/should we update the docs and warning messages to say we will infer a performant string dtype without specifically saying "pyarrow"? If a lighter-weight drop-in does become available, this might make it easier to swap them out.

WillAyd commented 8 months ago

@jbrockmendel the NEP is really well laid out and goes over the memory footprint as well as comparison to PyArrow:

https://numpy.org/neps/nep-0055-string_dtype.html#memory-layout-examples

Unfortunately looks to be a different layout so I think there would have to be a copy. @ngoldbaum may have more insights though

jorisvandenbossche commented 8 months ago

Indeed, the memory layout is different, and zero-copy conversion won't be possible (for the new Arrow string_view type, a partial zero-copy conversion might be possible for the numpy -> arrow direction, but also not for arrow -> numpy).

Short term, could/should we update the docs and warning messages to say we will infer a performant string dtype without specifically saying "pyarrow"?

At that point, I don't think it would make sense to have a warning (i.e. then we should just remove the warning). Because AFAIK the warning wasn't really introduced to warn about the behaviour change of having a string dtype, but specifically for the change in required dependencies.

ngoldbaum commented 8 months ago

Unfortunately looks to be a different layout so I think there would have to be a copy. @ngoldbaum may have more insights though

This is true. However:

MarcoGorelli commented 8 months ago

It's true that feedback from users is not unexpected and was discussed, but I'm not sure that such negative feedback from maintainers of other projects (e.g. scikit-learn and related) was. For example, this comment came after most of the pandas devs had already voted on the PDEP, and this one was made after 2.2. It's true that they'd already commented on the PDEP issue, but that was a much milder "it would be a pity if pyarrow becomes a mandatory dependency".

I just wanted to make sure people are aware of them.

Regardless of your thoughts on the library they're now recommending people switch to, is "we've made scikit-learn devs mad" a good look for pandas?

Dr-Irv commented 8 months ago

As a reminder, during the discussion on PDEP-10, I wrote the following at https://github.com/pandas-dev/pandas/pull/52711#issuecomment-1620923231 :

MHO, it's better that we get some feedback, rather than none. The wording as proposed doesn't commit us to saying we will not require pyarrow if we get negative feedback - it just says that we will get the feedback, which gives us the possibility of delaying the requirement based on that feedback.

I'd split the current feedback into 2 broad areas:

  1. People who don't want to see pyarrow required.
  2. People who don't like the deprecation message (or specifics about it, like having it start with \n)

I also think that one of the original reasons for this requirement was to reduce the development burden on the pandas team by having only one supported string type. It now seems that we may not be able to avoid that.

Given the nature of how this discussion has evolved, I personally would vote to not require pyarrow and we have to live with a strategy of "use pyarrow if available for strings, and use 'something else' if pyarrow is not available." It's not clear to me what the 'something else' should be (Numpy 2.0, current StringDtype, current object implementation, nanoarrow, etc.) but I think the decision to delay the pyarrow requirement can be made independent of whatever fallback mechanism we choose.

And if we do decide to delay requiring pyarrow, we can create a 2.3 without the deprecation warning. One advantage of reversing our decision is that it shows we do listen to the community.

rhshadrach commented 7 months ago

String PyArrow type as default and string object as fallback seems like a reasonable trade off to me.

If I'm understanding right, then behavior can change based on whether pyarrow is installed even though the code does nothing to invoke pyarrow. E.g.

ser = pd.Series(["1", None, "2"])
print(ser.iloc[1] == ser.iloc[1])

gives False with pyarrow installed, True when not. That seems quite undesirable.

jbrockmendel commented 7 months ago

gives False with pyarrow installed, True when not. That seems quite undesirable.

Yah, this came up in the PDEP-10 discussion. Different default behavior depending on whether pyarrow is installed is a maintenance nightmare.

eitsupi commented 7 months ago

As commented here, it may be noted that if portability to Linux distributions is an issue, the alternative of using arrow-rs instead of C++ libarrow may not be viable. https://github.com/pandas-dev/pandas/issues/54466#issuecomment-1936846576

phofl commented 7 months ago

If we want a simple alternative to keep pyarrow optional, I don't think we need to use numpy's new string dtype, though. We already have a object-dtype based StringDtype that can be the fallback when pyarrow is not installed. User still get the benefit of a new default, proper string dtype in 3.0 in all cases, but if they also want the performance improvements of the new string dtype, they need to have pyarrow installed. Then it's up to users to make that trade-off (and we can find ways to strongly encourage users to use pyarrow).

This seems to be the best option out of a set of bad options if we don't require arrow

MarcoGorelli commented 6 months ago

I got curious about the impact of pyarrow dtypes on the tpc-h benchmarks (as they test groupby aggs, merge, concat, filtering, ...), and to be fair pyarrow strings do make a noticeable difference:

Screenshot 2024-03-27 185810

I didn't include pandas w/numpy strings in the comparison, as filtering isn't ready for them yet. Nathan (from numpy) has also said he's working on other projects now, so realistically they won't be production-ready too soon.

Note that the performance difference above is due almost entirely (but not exclusively) to pyarrow strings (the other dtypes make some difference, but not nearly as noticeable as strings).


I'm most bothered by :

At the same time, the perf benefit to most users is real, and is a pity to not deliver

If it's not possible to split off a lightweight version of pyarrow strings from pyarrow, then tbh my preference would be to:

Dr-Irv commented 6 months ago

I'm most bothered by :

  • the environmental impact of pandas' 200k PyPI monthly downloads (!) more-than-doubling in size

Did yo mean "polars" instead of "pandas" here?

MarcoGorelli commented 6 months ago

No, I meant pandas - it's pandas whose size is going to increase. Polars doesn't depend on PyArrow, and doesn't plan to.

Though given that you ask, this may be a good moment to compare sizes:

Wheel sizes (will more than double!):

Visually: image

package sizes (this will nearly double):

image


There's a real benefit to using pyarrow strings, I'm just hoping there's a solution that can get that solution across to most users without alienating libraries whose primary reason to support pandas is the numpy backend

phofl commented 6 months ago

I think irv might have meant the number of downloads? I think we have around 200 million downloads instead of 200k?

jorisvandenbossche commented 5 months ago

String PyArrow type as default and string object as fallback seems like a reasonable trade off to me.

If I'm understanding right, then behavior can change based on whether pyarrow is installed even though the code does nothing to invoke pyarrow. E.g.

ser = pd.Series(["1", None, "2"])
print(ser.iloc[1] == ser.iloc[1])

gives False with pyarrow installed, True when not. That seems quite undesirable.

To clarify, the proposal is to use a variant of the existing object-dtype based StringArray that uses the same missing value semantics as the pyarrow-based one. So the above will give the same result regardless of pyarrow being installed or not.

Just like https://github.com/pandas-dev/pandas/pull/54533 added a variant of the pyarrow-based StringArray to have NaN nullable semantics (see https://github.com/pandas-dev/pandas/issues/54792 for the more general context), we need to do the same for the object-dtype based StringArray. I put up a PR that does this to better illustrate this option: https://github.com/pandas-dev/pandas/pull/58451

jorisvandenbossche commented 5 months ago

Based on the above discussion (several people seem to be OK with a fallback if pyarrow is not installed), and given that we need to make some (at least short term) decision on this to be able to move forward with the 3.0 release, I would like to make the following concrete proposal:

  1. For pandas 3.0, we enable a "string" dtype by default, which will use pyarrow if installed, and otherwise falls back to an in-house functionally-equivalent (but slower) version
  2. For our own version, I think the most realistic option short term is to adapt the existing numpy object-based StringArray to make it functionally equivalent (i.e. use NaN missing value semantics).
  3. We update installation guidelines to clearly encourage users to install pyarrow as the default user experience.

For 1), setting the pd.options.future.infer_strings option already gives you the future pyarrow-based string dtype. We can expand that to also work when pyarrow is not installed (and after that, we should start enabling this by default on main).

For 2), I opened PR https://github.com/pandas-dev/pandas/pull/58451 that implements this fallback (a new StringDtype(storage="python_numpy") option with the corresponding StringArrayNumpySemantics array). The required code changes are relatively little (although more test changes will be needed) because it mostly uses the existing StringArray. I think we could still backport this to 2.2.x, so it could get some more user feedback. Longer term, we can still investigate some of the other alternatives mentioned here to improve this fallback (eg use nanoarrow, numpy 2, ...), without further delaying 3.0.

For 3), Marco mentioned above we could use pip install pandas[pyarrow] by default in our installation instructions, to ensure most users will get the performance benefit of pyarrow, but can still install and use pandas without it if they want. For conda, we could even add pyarrow as a default dependency, and create a pandas-core package without it (similarly as other projects like dask and geopandas (and also pyarrow itself) provide a minimal installation).

Generally, I think this honors the gist of the PDEP (start using pyarrow for default functionality) and will give users a long awaited proper string dtype for 3.0, while listening to the feedback by not (or at least delaying) making pyarrow a hard dependency.

Thoughts? Feel free to discuss the details in the comments below, but could people also give a thumbs-up on this comment if they generally are onboard with such an approach for getting an idea of the room?

attack68 commented 5 months ago

If technically feasible, I think that it is a very good compromise for all aspects of the issue. For targeting pyarrow as a default dependency I would caution future releases to continue to focus on dependency optimisation to see if there are savings that can be made.

ngoldbaum commented 5 months ago

Myself and @lithomas1 are currently working on finishing a pandas string DType using the new numpy 2.0 variable length string dtype, hopefully in time for pandas 3.0. This would have to be gated behind numpy runtime version check, but also a possible option for users who have numpy 2.0 installed.

WillAyd commented 5 months ago

I'm pretty lukewarm on a fallback that uses Python strings; that is functionally a huge step back from Arrow strings (and assumedly NumPy 2.0).

jorisvandenbossche commented 5 months ago

I'm pretty lukewarm on a fallback that uses Python strings; that is functionally a huge step back from Arrow strings

Lukewarm is warm enough for me if that allows us to move forward ;) (although to be honest I might not get the exact subtlety as a non-native speaker about how to interpret it)

To note, compared to 2.0 there is no step back though: for the many users having pyarrow, pandas 3.0 will be a big step forward for string handling, for users without pyarrow it still is a nice step forward in having a proper and strict, dedicated string dtype (I know you are comparing it to the case of requiring Arrow, but I think the comparison from where users are right now is more important)

Myself and @lithomas1 are currently working on finishing a pandas string DType using the new numpy 2.0 variable length string dtype, hopefully in time for pandas 3.0

That will be interesting to see! As I mentioned above, I am very interested in further exploring alternatives on the longer term, and we certainly should consider the numpy string dtype there as well. But for the purpose of this decision right now for 3.0, IMO we can't take it that much into account (I personally think it is not that likely it will be fully ready for 3.0, but even if ready on time then we cannot use it as the sole alternative given its numpy version requirement, so if we want to make pyarrow only a soft dependency for the string dtype, we still need the numpy object-dtype based alternative anyway short term). (BTW it might be useful to open a separate issue to discuss if and how we want to integrate with the numpy string dtype, where we can go into more details of that approach and the current state of it)

WillAyd commented 5 months ago

Ah sorry - I should have been more clear that I am -0.5 on yet another string type. I really think our string type system is starting to get confusing...

lithomas1 commented 5 months ago

Ah sorry - I should have been more clear that I am -0.5 on yet another string type. I really think our string type system is starting to get confusing...

To be clear, this will probably be the eventual replacement for the object dtyped based numpy string array, since the new numpy string ufuncs should match the semantics for the Python string methods.

So, we'll still end up with 2 string dtypes (eventually).

As long as numpy is a hard dependency, we will probably want some sort of native numpy string dtype (since it wouldn't be ideal to copy numpy strings supplied by an user to object dtype or an Arrow dtype).

simonjayhawkins commented 5 months ago

Ah sorry - I should have been more clear that I am -0.5 on yet another string type. I really think our string type system is starting to get confusing...

I agree.

There seemed to be hesitation/concern in deviating (revoking parts of) from the agreed PDEP when it came to revoking the warning (https://github.com/pandas-dev/pandas/issues/57424#issuecomment-1955174398) and yet there IMHO has been a total deviation from the agreed PDEP with the implementation of a new string dtype with NumPy semantics (#54792).

I think this change was worthy of a PDEP in itself. Surely, the use of PyArrow and our Extension dtypes was to improve on (deviate away from) NumPy NaN semantics to a more consistent missing value indicator. I fail to see how this new string dtype (and the new fallback) is a long term benefit or aligns with one the original goals of PDEP-10 which was claimed to reduce the maintenance burden.

jorisvandenbossche commented 5 months ago

yet there IMHO has been a total deviation from the agreed PDEP with the implementation of a new string dtype with NumPy semantics (#54792).

It is unfortunate that this wasn't properly discussed at the time of the actual PDEP (I don't remember any discussion about it (but I can certainly be misremembering, didn't check), and the PDEP text itself also just says to use "string[pyarrow]" / "the more efficient pyarrow string type" that has been available since 1.2, without mentioning anything about the consequences of this choice).

I understand that others might find this a stretch, but given our complete lack of even mentioning null semantics at the time, personally I am interpreting the PDEP as using "a string dtype backed by pyarrow". Otherwise this means a silent and implicit change of the default null semantics for our users, while that is a change that would definitely warrant its own PDEP (and which is coming). For lack of specifying this, preserving the default null semantics seems the better default position to me.

I think the discussion in https://github.com/pandas-dev/pandas/issues/54792 already gives some context around why we think it is important to preserve the default dtypes and null semantics for now (and specifically in https://github.com/pandas-dev/pandas/issues/54792#issuecomment-1948520994, where I try to explain that this is meant to make the change less confusing for users)