FEEDBACK: PyArrow as a required dependency and PyArrow backed strings

phofl commented 11 months ago

This is an issue to collect feedback on the decision to make PyArrow a required dependency and to infer strings as PyArrow backed strings by default.

The background for this decision can be found here: https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html

If you would like to filter this warning without installing pyarrow at this time, please view this comment: https://github.com/pandas-dev/pandas/issues/54466#issuecomment-1919988166

mynewestgitaccount commented 11 months ago

Something that hasn't received enough attention/discussion, at least in my mind, is this piece of the Drawbacks section of the PDEP (bolding added by me):

Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow using pip from wheels, numpy and pandas requires about 70MB, and including PyArrow requires an additional 120MB. An increase of installation size would have negative implication using pandas in space-constrained development or deployment environments such as AWS Lambda.

I honestly don't understand how mandating a 170% increase in the effective size of a pandas installation (70MB to 190MB, from the numbers in the quoted text) can be considered okay.

For that kind of increase, I would expect/want the tradeoff to be major improvements across the board. Instead, this change comes with limited benefit but massive bloat for anyone who doesn't need the features PyArrow enables, e.g. for those who don't have issues with the current functionality of pandas.

rebecca-palmer commented 11 months ago

Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible. (System packages are not allowed to depend on non-system packages.)

I don't know whether creating a system package of pyarrow is possible with reasonable effort, or whether this would make the system pandas packages impossible to update (and eventually require their removal when old pandas was no longer compatible with current Python/numpy).

mroeschke commented 11 months ago

For that kind of increase, I would expect/want the tradeoff to be major improvements across the board.

Yeah unfortunately this is where the subjective tradeoff comes into effect. pytz and dateutil as required dependencies have a similar issue for users who do not need timezone or date parsing support respectively. The hope with pyarrow is that the tradeoff improves the current functionality for common "object" types in pandas such as text, binary, decimal, and nested data.

Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible.

AFAIK most pydata projects don't actually publish/manage Linux system packages for their respective libraries. Do you know how these are packaged today?

mynewestgitaccount commented 11 months ago

pytz and dateutil as required dependencies have a similar issue for users who do not need timezone or date parsing support respectively.

The pytz and dateutil wheels are only ~500kb. Drawing a comparison between them and PyArrow seems like a stretch, to put it lightly.

rebecca-palmer commented 11 months ago

Do you know how these are packaged today?

By whoever offers to do it, currently me for pandas. Of the pydata projects, Debian currently has pydata-sphinx-theme, sparse, patsy, xarray and numexpr.

An old discussion thread (anyone can post there, but be warned that doing so will expose your non-spam-protected email address) suggests that there is existing work on a pyarrow Debian package, but I don't yet know whether it ever got far enough to work.

rebecca-palmer commented 11 months ago

I do intend to investigate this further at some point - I haven't done so yet because Debian updated numexpr to 2.8.5, breaking pandas (#54449 / #54546), and fixing that is currently more urgent.

jjerphan commented 11 months ago

Hi,

Thanks for welcoming feedback from the community.

While I respect you decision, I am afraid that making pyarrow a required dependency will come with costly consequences for users and downstream libraries' developers and maintainers for two reasons:

installing pyarrow after pandas in a fresh conda environment increases its size from approximately 100MiB to approximately 500 MiB.

Packages size

``` libgoogle-cloud-2.12.0-h840a212_1 : 46106632 bytes, python-3.11.4-hab00c5b_0_cpython : 30679695 bytes, libarrow-12.0.1-h10ac928_8_cpu : 27696900 bytes, ucx-1.14.1-h4a2ce2d_3 : 15692979 bytes, pandas-2.0.3-py311h320fe9a_1 : 14711359 bytes, numpy-1.25.2-py311h64a7726_0 : 8139293 bytes, libgrpc-1.56.2-h3905398_1 : 6331805 bytes, libopenblas-0.3.23-pthreads_h80387f5_0 : 5406072 bytes, aws-sdk-cpp-1.10.57-h85b1a90_19 : 4055495 bytes, pyarrow-12.0.1-py311h39c9aba_8_cpu : 3989550 bytes, libstdcxx-ng-13.1.0-hfd8a6a1_0 : 3847887 bytes, rdma-core-28.9-h59595ed_1 : 3735644 bytes, libthrift-0.18.1-h8fd135c_2 : 3584078 bytes, tk-8.6.12-h27826a3_0 : 3456292 bytes, openssl-3.1.2-hd590300_0 : 2646546 bytes, libprotobuf-4.23.3-hd1fb520_0 : 2506133 bytes, libgfortran5-13.1.0-h15d22d2_0 : 1437388 bytes, pip-23.2.1-pyhd8ed1ab_0 : 1386212 bytes, krb5-1.21.2-h659d440_0 : 1371181 bytes, libabseil-20230125.3-cxx17_h59595ed_0 : 1240376 bytes, orc-1.9.0-h385abfd_1 : 1020883 bytes, ncurses-6.4-hcb278e6_0 : 880967 bytes, pygments-2.16.1-pyhd8ed1ab_0 : 853439 bytes, jedi-0.19.0-pyhd8ed1ab_0 : 844518 bytes, libsqlite-3.42.0-h2797004_0 : 828910 bytes, libgcc-ng-13.1.0-he5830b7_0 : 776294 bytes, ld_impl_linux-64-2.40-h41732ed_0 : 704696 bytes, libnghttp2-1.52.0-h61bc06f_0 : 622366 bytes, ipython-8.14.0-pyh41d4057_0 : 583448 bytes, bzip2-1.0.8-h7f98852_4 : 495686 bytes, setuptools-68.1.2-pyhd8ed1ab_0 : 462324 bytes, zstd-1.5.2-hfc55251_7 : 431126 bytes, libevent-2.1.12-hf998b51_1 : 427426 bytes, libgomp-13.1.0-he5830b7_0 : 419184 bytes, xz-5.2.6-h166bdaf_0 : 418368 bytes, libcurl-8.2.1-hca28451_0 : 372511 bytes, s2n-1.3.48-h06160fa_0 : 369441 bytes, aws-crt-cpp-0.21.0-hb942446_5 : 320415 bytes, readline-8.2-h8228510_1 : 281456 bytes, libssh2-1.11.0-h0841786_0 : 271133 bytes, prompt-toolkit-3.0.39-pyha770c72_0 : 269068 bytes, libbrotlienc-1.0.9-h166bdaf_9 : 265202 bytes, python-dateutil-2.8.2-pyhd8ed1ab_0 : 245987 bytes, re2-2023.03.02-h8c504da_0 : 201211 bytes, aws-c-common-0.9.0-hd590300_0 : 197608 bytes, aws-c-http-0.7.11-h00aa349_4 : 194366 bytes, pytz-2023.3-pyhd8ed1ab_0 : 186506 bytes, aws-c-mqtt-0.9.3-hb447be9_1 : 162493 bytes, aws-c-io-0.13.32-h4a1a131_0 : 154523 bytes, ca-certificates-2023.7.22-hbcca054_0 : 149515 bytes, lz4-c-1.9.4-hcb278e6_0 : 143402 bytes, python-tzdata-2023.3-pyhd8ed1ab_0 : 143131 bytes, libedit-3.1.20191231-he28a2e2_2 : 123878 bytes, keyutils-1.6.1-h166bdaf_0 : 117831 bytes, tzdata-2023c-h71feb2d_0 : 117580 bytes, gflags-2.2.2-he1b5a44_1004 : 116549 bytes, glog-0.6.0-h6f12383_0 : 114321 bytes, c-ares-1.19.1-hd590300_0 : 113362 bytes, libev-4.33-h516909a_1 : 106190 bytes, aws-c-auth-0.7.3-h28f7589_1 : 101677 bytes, libutf8proc-2.8.0-h166bdaf_0 : 101070 bytes, traitlets-5.9.0-pyhd8ed1ab_0 : 98443 bytes, aws-c-s3-0.3.14-hf3aad02_1 : 86553 bytes, libexpat-2.5.0-hcb278e6_1 : 77980 bytes, libbrotlicommon-1.0.9-h166bdaf_9 : 71065 bytes, parso-0.8.3-pyhd8ed1ab_0 : 71048 bytes, libzlib-1.2.13-hd590300_5 : 61588 bytes, libffi-3.4.2-h7f98852_5 : 58292 bytes, wheel-0.41.1-pyhd8ed1ab_0 : 57374 bytes, aws-c-event-stream-0.3.1-h2e3709c_4 : 54050 bytes, aws-c-sdkutils-0.1.12-h4d4d85c_1 : 53123 bytes, aws-c-cal-0.6.1-hc309b26_1 : 50923 bytes, aws-checksums-0.1.17-h4d4d85c_1 : 50001 bytes, pexpect-4.8.0-pyh1a96a4e_2 : 48780 bytes, libnuma-2.0.16-h0b41bf4_1 : 41107 bytes, snappy-1.1.10-h9fff704_0 : 38865 bytes, typing_extensions-4.7.1-pyha770c72_0 : 36321 bytes, libuuid-2.38.1-h0b41bf4_0 : 33601 bytes, libbrotlidec-1.0.9-h166bdaf_9 : 32567 bytes, libnsl-2.0.0-h7f98852_0 : 31236 bytes, wcwidth-0.2.6-pyhd8ed1ab_0 : 29133 bytes, asttokens-2.2.1-pyhd8ed1ab_0 : 27831 bytes, stack_data-0.6.2-pyhd8ed1ab_0 : 26205 bytes, executing-1.2.0-pyhd8ed1ab_0 : 25013 bytes, _openmp_mutex-4.5-2_gnu : 23621 bytes, libgfortran-ng-13.1.0-h69a702a_0 : 23182 bytes, libcrc32c-1.1.2-h9c3ff4c_0 : 20440 bytes, aws-c-compression-0.2.17-h4d4d85c_2 : 19105 bytes, ptyprocess-0.7.0-pyhd3deb0d_0 : 16546 bytes, pure_eval-0.2.2-pyhd8ed1ab_0 : 14551 bytes, libblas-3.9.0-17_linux64_openblas : 14473 bytes, liblapack-3.9.0-17_linux64_openblas : 14408 bytes, libcblas-3.9.0-17_linux64_openblas : 14401 bytes, six-1.16.0-pyh6c4a22f_0 : 14259 bytes, backcall-0.2.0-pyh9f0ad1d_0 : 13705 bytes, matplotlib-inline-0.1.6-pyhd8ed1ab_0 : 12273 bytes, decorator-5.1.1-pyhd8ed1ab_0 : 12072 bytes, backports.functools_lru_cache-1.6.5-pyhd8ed1ab_0 : 11519 bytes, pickleshare-0.7.5-py_1003 : 9332 bytes, prompt_toolkit-3.0.39-hd8ed1ab_0 : 6731 bytes, backports-1.0-pyhd8ed1ab_3 : 5950 bytes, python_abi-3.11-3_cp311 : 5682 bytes, _libgcc_mutex-0.1-conda_forge : 2562 bytes, ```

pyarrow also depends on libarrow which itself depends on several notable C and C++ libraries. This constraints the installation of other packages whose dependencies might be incompatible with libarrow's, making pandas potentially unusable in some context.

Have you considered those two observations as drawbacks before taking the decision?

lithomas1 commented 11 months ago

Hi,

Thanks for welcoming feedback from the community.

While I respect you decision, I am afraid that making pyarrow a required dependency will come with costly consequences for users and downstream libraries' developers and maintainers for two reasons:

installing pyarrow after pandas in a fresh conda environment increases its size from approximately 100MiB to approximately 500 MiB.

Packages size

pyarrow also depends on libarrow which itself depends on several notable C and C++ libraries. This constraints the installation of other packages whose dependencies might be incompatible with libarrow's, making pandas potentially unusable in some context.

Have you considered those two observations as drawbacks before taking the decision?

This is discussed a bit in https://github.com/pandas-dev/pandas/pull/52711/files#diff-3fc3ce7b7d119c90be473d5d03d08d221571c67b4f3a9473c2363342328535b2R179-R193 (for pip only I guess).

While currently the build size for pyarrow is pretty large, it doesn't "have" to be that big. I think by pandas 3.0 (when pyarrow will actually become required), at least some components will be spun out/made optional/something like that (I heard that the arrow people were talking about this).

(cc @jorisvandenbossche for more info on this)

I'm not an Arrow dev myself, but if is something that just needs someone to look at, I'm happy to put some time in help give Arrow a nudge in the right direction.

Finally, for clarity purposes, is the reason for concern also AWS lambda/pyodide/Alpine, or something else?

(IMO, outside of stuff like lambda funcs, pyarrow isn't too egregious in terms of package size compared to pytorch/tensorflow but it's definetely something that can be improved)

jjerphan commented 11 months ago

If libarrow is slimmed down by having non-essential Arrow features be extracted into other libraries which could be optional dependencies, I think most people's concerns would be addressed.

Edit: See https://github.com/conda-forge/arrow-cpp-feedstock/issues/1035

DerThorsten commented 11 months ago

Hi,

Thanks for welcoming feedback from the community. For wasm builds of python / python-packages (ie pyodide / emscripten-forge) package size really matters since these packages have to be downloaded from within the browser. Once a package is too big, usability suffers drastically.

With pyarrow as a required dependency, pandas is less usable from python in the browser.

surfaceowl commented 11 months ago

Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible. (System packages are not allowed to depend on non-system packages.)

I don't know whether creating a system package of pyarrow is possible with reasonable effort, or whether this would make the system pandas packages impossible to update (and eventually require their removal when old pandas was no longer compatible with current Python/numpy).

There is another way - use virtual environments in user space instead of system python. The Python Software Foundation recommends users create virtual environments; and Debian/Ubuntu want users to leave the system python untouched to avoid breaking system python.

Perhaps Pandas could add some warnings or error messages on install to steer people to virtualenv. This approach might avoid or at least defer work of adding pyarrow to APT as well as the risks of users breaking system python. Also which I'm building projects I might want a much later version of pandas/pyarrow than would ever ship on Debian given the release strategy/timing delay.

On the other hand, arrow backend has significant advantages and with the rise of other important packages in the data space that also use pyarrow (polars, dask, modin), perhaps there is sufficient reason to add pyarrow to APT sources.

A good summary that might be worth checking out is Externally managed environments. The original PEP 668 is found here.

stonebig commented 11 months ago

I think it's the rigth path for performance in WASM.

mlkui commented 11 months ago

This is a good idea! But I think there are also two important features should also be implemented except strings:

Zero-copy for multi-index dataframe. Currently, multi-index dataframe can not be convert from arrow table with zero copy(zero_copy_only=True), which is a BIGGER problem for big dataframe. You can reset_index() the dataframe, convert it to arrow table, and convert arrow table back to dataframe with zero copy, but after all, you must use call set_index() to the dataframe to get multi-index back, then copy happens.
Zero-copy for pandas.concat. Arrow table concat can be zero-copy, but when concat two zero-copy dataframe(convert from arrow table), copy happens even pandas COW is turned on. Also, currently, trying to concat two arrow table and then convert the table to dataframe with zero_copy_only=True is also not allowed as the chunknum>1.

phofl commented 11 months ago

@mlkui

Regarding concat: This should already be zero copy:

df = pd.DataFrame({"a": [1, 2, 3]}, dtype="int64[pyarrow]")
df2 = pd.DataFrame({"a": [1, 2, 3]}, dtype="int64[pyarrow]")

x = pd.concat([df, df2])

This creates a new dataframe that has 2 pyarrow chunks.

Can you open a separate issue if this is not what you are looking for?

mlkui commented 11 months ago

@phofl Thanks for your reply. But your example may be too simple. Please view the following codes(pandas 2.0.3 and pyarrow 12.0/ pandas 2.1.0 and pyarrow 13.0):

        with pa.memory_map("d:\\1.arrow", 'r') as source1, pa.memory_map("d:\\2.arrow", 'r') as source2, pa.memory_map("d:\\3.arrow", 'r') as source3, pa.memory_map("d:\\4.arrow", 'r') as source4:

            c1 = pa.ipc.RecordBatchFileReader(source1).read_all().column("p")
            c2 = pa.ipc.RecordBatchFileReader(source2).read_all().column("v")
            c3 = pa.ipc.RecordBatchFileReader(source1).read_all().column("p")
            c4 = pa.ipc.RecordBatchFileReader(source2).read_all().column("v")
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            s1 = c1.to_pandas(zero_copy_only=True)
            s2 = c2.to_pandas(zero_copy_only=True)
            s3 = c3.to_pandas(zero_copy_only=True)
            s4 = c4.to_pandas(zero_copy_only=True)
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            dfs = {"p": s1, "v": s2}
            df1 = pd.concat(dfs, axis=1, copy=False)                            #zero-copy
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            dfs2 = {"p": s3, "v": s4}
            df2 = pd.concat(dfs2, axis=1, copy=False)                           #zero-copy
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            # NOT zero-copy
            result_df = pd.concat([df1, df2], axis=0, copy=False)

        with pa.memory_map("z1.arrow", 'r') as source1, pa.memory_map("z2.arrow", 'r') as source2:

            table1 = pa.ipc.RecordBatchFileReader(source1).read_all()
            table2 = pa.ipc.RecordBatchFileReader(source2).read_all()
            combined_table = pa.concat_tables([table1, table2])
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))        #Zero-copy

            df1 = table1.to_pandas(zero_copy_only=True)
            df2 = table2.to_pandas(zero_copy_only=True)
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))       #Zero-copy

            #Use pandas to concat two zero-copy dataframes
            #But copy happens
            result_df = pd.concat([df1, df2], axis=0, copy=False)

            #Try to convert the arrow table to pandas directly
            #This will raise exception for chunk number is 2
            df3 = combined_table.to_pandas(zero_copy_only=True)

            # Combining chunks to one will cause copy
            combined_table = combined_table.combine_chunks()

0x26res commented 10 months ago

Beside the build size, there is a portability issue with pyarrow.

pyarrow does not provide wheels for as many environment as numpy.

For environments where pyarrow does not provide wheels, pyarrow has to be installed from source which is not simple.

flying-sheep commented 9 months ago

If this happens, would dtype='string' and dtype='string[pyarrow]' be merged into one implementation?

We’re currently thinking about coercing strings in our library, but hesitating because of the unclear future here.

EwoutH commented 9 months ago

pyarrow does not provide wheels for as many environment as numpy.

The fact that they still don’t have Python 3.12 wheels up is worrisome.

h-vetinari commented 9 months ago

The fact that they still don’t have Python 3.12 wheels up is worrisome.

Arrow is a beast to build, and even harder to fit into a wheel properly (so you get less features, and things like using the slimmed-down libarrow will be harder to pull off).

Conda-forge builds for py312 have been available for a month already though, and are ready in principle to ship pyarrow with a minimal libarrow. That still needs some usability improvements, but it's getting there.

musicinmybrain commented 8 months ago

Without weighing in on whether this is a good idea or a bad one, Fedora Linux already has a libarrow package that provides python3-pyarrow, so I think this shouldn’t be a real problem for us from a packaging perspective.

I’m not saying that Pandas is easy to keep packaged, up to date, and coordinated with its dependencies and reverse dependencies! Just that a hard dependency on PyArrow wouldn’t necessarily make the situation worse for us.

ZupoLlask commented 8 months ago

@h-vetinari Almost there? :-)

raulcd commented 8 months ago

@h-vetinari Almost there? :-)

There is still a lot of work to be done on the wheels side but for conda after the work we did to divide the CPP library, I created this PR which is currently under discussion in order to provide both a pyarrow-base that only depends on libarrow and libparquet and pyarrow which would pull all the Arrow CPP dependencies. Both have been built with support for everything so depending on pyarrow-base and libarrow-dataset would allow the use of pyarrow.dataset, etc.

chris-vecchio commented 7 months ago

Thanks for requesting feedback. I'm not well versed on the technicalities, but I strongly prefer to not require pyarrow as a dependency. It's better imo to let users choose to use PyArrow if they desire. I prefer to use the default NumPy object type or pandas' StringDType without the added complexity of PyArrow.

phofl commented 7 months ago

@flying-sheep

If this happens, would dtype='string' and dtype='string[pyarrow]' be merged into one implementation?

We’re currently thinking about coercing strings in our library, but hesitating because of the unclear future here.

sorry for the slow response, dtype=string will be arrow backed starting from 3.0 or when you activate the infer_string option

mynewestgitaccount commented 7 months ago

From the PDEP:

Starting in pandas 2.2, pandas raises a FutureWarning when PyArrow is not installed in the users environment when pandas is imported. This will ensure that only one warning is raised and users can easily silence it if necessary. This warning will point to the feedback issue.

Is this still planned? It doesn't seem to be occurring in 2.2.0rc0 👀

lithomas1 commented 6 months ago

From the PDEP:

Starting in pandas 2.2, pandas raises a FutureWarning when PyArrow is not installed in the users environment when pandas is imported. This will ensure that only one warning is raised and users can easily silence it if necessary. This warning will point to the feedback issue.

Is this still planned? It doesn't seem to be occurring in 2.2.0rc0 👀

I think we are going to add a DeprecationWarning now. (It's not currently in master now, but I'm planning on putting in a warning before the actual release of 2.2).

toni-neurosc commented 6 months ago

Hi, I don't know much about PyArrow overall but when it comes to saving large dataframes as CSV files, I detected that Pandas was being super slow and decided to give PyArrow a try instead, and the difference in performance was astounding, 8x times faster. For a 1GB, all np.float64 dataset:

pandas_df.to_csv(): Time to save: 45.128990650177 seconds.
pyarrow.csv.write_csv(): Time to save: 6.1338911056518555 seconds.

I tried stuff like different chucksizes and index=False but it did not help.

However, then I tested PyArrow for reading the exact same dataset and it was 2x slower than Pandas:

Time to read CSV (pyarrow): 14.770382642745972 seconds.
Time to read CSV (pandas): 8.440594673156738 seconds.

So, my suggestion I guess would be, to see which tasks are being done more efficiently by PyArrow and incorporate those, and the things that are faster/better in Pandas can stay the same (or maybe PyArrow can incorporate them).

phofl commented 6 months ago

That's exactly what we intend to do. The csv default engine will stay the same for the time being

toni-neurosc commented 6 months ago

That's exactly what we intend to do. The csv default engine will stay the same for the time being

Thanks for your answer Patrick. I missed that there is already an issue open already to add the pyarrow engine to the to_csv method here, so clearly I'm half a year late to the party. Excuse me for rushing to post, should I delete my previous post?

mgorny commented 6 months ago

My initial experience with pandas 2.2.0 + pyarrow is that the test suite crashes CPython on assertions. I will report a bug once I get a clear traceback. This will take some time, as I suppose I need to run them without xdist.

mgorny commented 6 months ago

My initial experience with pandas 2.2.0 + pyarrow is that the test suite crashes CPython on assertions. I will report a bug once I get a clear traceback. This will take some time, as I suppose I need to run them without xdist.

I'm sorry but I can't reproduce anymore. I have had apache-arrow built without all the necessary features, and I've fixed that while testing in serial, so my only guess is that the crashes were due to bad error handling when running tests with xdist. I'm sorry for the noise.

willrichmond commented 6 months ago

pyarrow isn't compatible with the most recent versions of numpy (on 1.26)

pyarrow 0.15.0 would require │ ├─ numpy >=1.16,<1.20.0a0 , which conflicts with any installable versions previously reported;

phofl commented 6 months ago

Pyarrow 15 is the newest release, not 0.15

jorenham commented 6 months ago

NumPy is planning to add support for UTF-8 variable-width string DTypes in NEP 55.

Also, if PyArrow is truly going to be a required required dependency in Pandas 3.0, then I don't see the point of the current DeprecationWarning in pandas 2.2.0. All sane package managers install required dependencies automatically, so users don't need to take any action anyway.

jorenham commented 6 months ago

And as for my opinion: I personally find working with Pandas already complicated enough. So I'm afraid that throwing PyArrow is going to make things worse in that aspect.

In other words:

But like has been said before, the potential benefits haven't been made very clear (yet?), so it's hard to give constructive feedback.

jjerphan commented 6 months ago

@phofl: I think it would be valuable that pandas' maintainers provide reasons for having pandas 3 require PyArrow as a dependency.

hagenw commented 6 months ago

Motivation is briefly outlined in PDEP 10.

pyarrow is already integrated in parts of pandas and it will most likely provide a way to solve the issue that pandas does not only work well with small amounts of data, but also with huge data where it is not the best option at the moment.

milosivanovic commented 6 months ago

Also, if PyArrow is truly going to be a required required dependency in Pandas 3.0, then I don't see the point of the current DeprecationWarning in pandas 2.2.0. All sane package python managers install required dependencies automatically, so users don't need to take any action anyway.

I have the same question - could someone point me to the justification for why the DeprecationWarning was added? Why do users need to manually install pyarrow now, or be told that a new dependency will be required in a release that isn't even out yet?

aman123shampy commented 6 months ago

thanks

jond01 commented 6 months ago

The deprecation warning is ok - but I would like to have a specific pyarrrow "extra" of the pandas package, so that I know my version matches pandas' expectations. Currently, three extras install pyarrow: "feather", "parquet", and "all". It would be nice to add "pyarrow" extra until pandas 3.0 is out, which enables the following:

pip install "pandas[pyarrow]"

miraculixx commented 6 months ago

Thanks for taking feedback from the community.

PDEP 10 lists the following benefits for making pyarrow a required dependency:

Immediate User Benefit 1: pyarrow strings Immediate User Benefit 2: Nested Datatypes Immediate User Benefit 3: Interoperability

From my pov none of these benefits the typical pandas user, unless they already use pyarrow. If they don't they probably don't need the complexity that pyarrow brings with it (as any package of that magnitude does). In this sense I don't feel the rationale given in the PDEP would find a majority in the wider community.

In my opinion, pyarrow should be kept as an optional extra for those users who may need it. This way everyone benefits, from small to large use cases. If pyarrow is made a required dependency primarily large use cases benefit, while all the majority of use cases incur quite a substantial cost (not least due to requiring more disk space but also by making it more difficult to install pandas in some environments).

MarcoGorelli commented 6 months ago

Thanks all for comments!

I can't say anything for certain yet, but I'll start by noting that it looks like this may not be a done deal.

On the numpy side: https://github.com/numpy/numpy/pull/25625/files

we will add implementations for the comparison operators as well as an add loop that accepts two string arrays, multiply loops that accept string and integer arrays, an isnan loop, and implementations for the str_len, isalpha, isdecimal, isdigit, isnumeric, isspace, find, rfind, count, strip, lstrip, rstrip, and replace string ufuncs that will be newly available in NumPy 2.0.

and on today's pandas community call, it was mentioned that

if there's a viable alternative to pyarrow strings, then maybe pyarrow doesn't need to be made required

More updates coming in due course

js345-ai commented 6 months ago

Warning (from warnings module): File ", line 1 import pandas as pd DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

BUT I GET THE OUTPUT I DONT WANT TO GET THE WARNING MESSAGE I WANT TO IGNORE THAT WARNING MESSAGE

adrinjalali commented 6 months ago

You can install pyarrow to silence the warning. In some other places we're thinking of switching to polars since this warning has come up.

MarcoGorelli commented 6 months ago

Alternatively, if you want to just silence the warning for now:

import warnings

with warnings.catch_warnings():
    warnings.filterwarnings(
        "ignore",
        message=r'\nPyarrow will become',
        category=DeprecationWarning,
    )
    import pandas as pd

I wouldn't normally suggest silencing deprecationwarnings, but given the circumstances this one may be different

Alternatively, just pin pandas < 2.2 for now

adrinjalali commented 6 months ago

@MarcoGorelli I don't see people writing this much code on top of so many of their files/modules/notebooks to silence the warning. It's very annoying, and making CIs fail, where the only solution for those CIs is to add pyarrow to the deps, which itself is huge.

js345-ai commented 6 months ago

You can install pyarrow to silence the warning. In some other places we're thinking of switching to polars since this warning has come up.

how to install?

MarcoGorelli commented 6 months ago

like this: https://arrow.apache.org/docs/python/install.html

MPhuong124019 commented 6 months ago

Data and DataFrame/Untitled.py:4: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

import pandas as pd

adrinjalali commented 6 months ago

FYI, AWS dependencies of pyarrow are another huge issue:

https://github.com/scikit-learn/scikit-learn/pull/28258#issuecomment-1910294722

pandas-dev / pandas

FEEDBACK: PyArrow as a required dependency and PyArrow backed strings #54466