FEEDBACK: PyArrow as a required dependency and PyArrow backed strings

phofl commented 1 year ago

This is an issue to collect feedback on the decision to make PyArrow a required dependency and to infer strings as PyArrow backed strings by default.

The background for this decision can be found here: https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html

If you would like to filter this warning without installing pyarrow at this time, please view this comment: https://github.com/pandas-dev/pandas/issues/54466#issuecomment-1919988166

admajaus commented 8 months ago

Pyarrow is a HUGE library - it's over 70MB. If it's part of your deployment package to AWS lambda or any cloud computing service with size restrictions and you already have numpy, pandas, and a plotting library, this will easily put your over size restrictions, even if you move your deployment package into cloud storage(s3, for example). If pyarrow will become a pandas dependencey, you need to parse out from the overall package what is actually needed vs. making people download the whole, massive library.

mgorny commented 8 months ago

We build arrow and run the test suite successfully on all the mentioned architecture in conda-forge, though admittedly the stack of dependencies is pretty involved (grpc, protobuf, the major cloud SDKs, etc.). Feel free to check out our recipe if you need some inspiration, or open an issue on the feedstock if you have some questions.

Does that include 32-bit arches? The errors I'm getting from pyarrow's test suite suggest things may not work at all:

FAILED tests/interchange/test_conversion.py::test_pandas_roundtrip_string - OverflowError: Python int too large to convert to C ssize_t
FAILED tests/interchange/test_conversion.py::test_pandas_roundtrip_large_string - OverflowError: Python int too large to convert to C ssize_t
FAILED tests/interchange/test_conversion.py::test_pandas_roundtrip_string_with_missing - OverflowError: Python int too large to convert to C ssize_t
FAILED tests/interchange/test_conversion.py::test_pandas_roundtrip_categorical - OverflowError: Python int too large to convert to C ssize_t
FAILED tests/interchange/test_conversion.py::test_empty_dataframe - OverflowError: Python int too large to convert to C ssize_t

That said, I need to figure out if it's a problem with pyarrow or pandas first, before reporting this.

h-vetinari commented 8 months ago

Does that include 32-bit arches?

Nope, 64bit only, sorry. We already dropped support for 32bit years ago, and it's IMO very badly tested across the ecosystem, so I expect significant bitrot to have set in.

wimglenn commented 8 months ago

This is a strange use of DeprecationWarning. There is nothing being deprecated, and I would expect dependency changes in the next major release anyway. Using a warning for this informational message causes problems in environments when user chooses to escalate warnings into errors.

It's tangential to the question of growing a pyarrow dependency, but I'm not sure that issuing a warning was the best way to collect user feedback.

ZupoLlask commented 8 months ago

@admajaus It has been explained above that future mandatory pyarrow dependency will not imply current pyarrow package but the new pyarrow-core (libarrow only) or pyarrow-base (libarrow and libparquet only), that will be published for the first time in a matter of weeks.

https://github.com/conda-forge/arrow-cpp-feedstock/pull/1255

This has nothing to do with a size of 70MB.

bersbersbers commented 8 months ago

@admajaus It has been explained above that future mandatory pyarrow dependency will not imply current pyarrow package

I think it had not (at least not explicitly), but assuming that is true, thanks for clarifying!

zhizheng1 commented 8 months ago

@admajaus It has been explained above that future mandatory pyarrow dependency will not imply current pyarrow package but the new pyarrow-core (libarrow only) or pyarrow-base (libarrow and libparquet only), that will be published for the first time in a matter of weeks.

conda-forge/arrow-cpp-feedstock#1255

This has nothing to do with a size of 70MB.

My understanding is this is only for conda; how about PyPI wheels?

ZupoLlask commented 8 months ago

@zhizheng1 Your doubt is absolutely reasonable and I don't have an answer for you. However, this is the way I see it: as the developers working in this change for conda are Arrow developers, it wouldn't make sense that this change isn't also coming to PyPI even if it lands a bit later.

I may be wrong but as long as the (hard) work is done for conda, it will be a matter of time (way before Pandas 3.0 release) to have the new wheels available in PyPI.

combiz commented 8 months ago

Preference would be to add the PyArrow deps as extras_require with users requiring the new functionality installing pandas with, e.g. pip install pandas[full]

bersbersbers commented 8 months ago

@zhizheng1 Your doubt is absolutely reasonable [...] I don't have an answer for you [...] the way I see it [...] it wouldn't make sense [...] I may be wrong

That's a lot of uncertainty regarding your earlier statement of

future mandatory pyarrow dependency will not imply current pyarrow package

ZupoLlask commented 8 months ago

@bersbersbers As far as I know, there's no fixed release month settled for Pandas 3.0. From what I see in several repositories, there's several people working everyday to bring a libarrow-only pyarrow-core to light.

Apart from that, this is easily one of the most commented issues in this repository. There's no evidence that these concerns won't be addressed properly.

Shall we give some time to let the dust settle a bit? :-)

Yes, I admit that those quotes seem inconsistent, but I see there are PR that are going to be merged soon at Arrow repository to enable this sort of split... only for conda? It makes no sense. There's been some comments regarding PyPI but as that's not what is currently being worked on, I guess that people is trying to first focus in conda and PyPI will come next.

Drewskie75 commented 8 months ago

This is an issue to collect feedback on the decision to make PyArrow a required dependency and to infer strings as PyArrow backed strings by default.

The background for this decision can be found here: https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html

If you would like to filter this warning without installing pyarrow at this time, please view this comment: #54466 (comment)

surprising but nice update coming soon

mroeschke commented 8 months ago

I'm going try to summarize/respond to prevailing themes and some questions in this thread as of 2024-02-15:

PyArrow as a required dependency in 3.0

I think the prevailing concerns so far (some of which are mentioned in the Drawback section of the proposal) are:

I (may/will not benefit) from strings being backed by PyArrow by default in pandas 3.0 and therefore find PyArrow being required as unnecessary.
I am running pandas on [platform/environment] where including PyArrow may/will become onerous to include (too large to install/doesn't work).

While the plan as of now is to still move forward with PyArrow as a required dependency in pandas 3.0, which is tentatively scheduled for release in April 2024, I think the volume of response has spurred serious reconsideration of this decision in https://github.com/pandas-dev/pandas/issues/57073

The (annoying) `DeprecationWarning` upon importing `pandas` that probably led you here

The core team is currently voting in https://github.com/pandas-dev/pandas/issues/57424 on whether to remove this warning in the pandas 2.2.1 which is schedule to be released next week (the week of 2024-02-19)

Including a way to install pandas and get pyarrow automatically

At least when installing with pip, yes, we will add an extra so that pip users can use pip install pandas[pyarrow]

I would like, however, to hear whether you plan to switch away from Numpy as one of the core back-ends

@enzbus Numpy will probably never be dropped as a back-end, but like the current proposal, Numpy may not be the default back end for some types (strings, list, dict, decimal, etc.)

Are there 3 distinct arrow string types in pandas?

@wirable23 I would say "flavors" but (unfortunately) yes, due to legacy reasons/efforts to maintain backward compatibility

"string[pyarrow]" aka pandas.StringDtype("pyarrow"): introduced in pandas 1.3. Uses pd.NA as it's missing value.
pandas.ArrowDtype(pa.string()): Introduced in pandas 1.5 as a consequence of pandas.ArrowDtype supporting all Arrow types. Uses pd.NA as it's missing value.
"string[pyarrow_numpy]" aka pandas.StringDtype("pyarrow_numpy"): Introduced in pandas 2.1. Uses np.nan as its missing value to be more backward compatible with existing default NumPy dtypes and is the proposed default string type in pandas 3.0

susmitpy commented 8 months ago

Adding pyarrow as a required dependency will cause the size of pandas library to explode. This is very crucial for serverless functions such as aws lambda functions , gcp cloud functions etc. As not only does it will have an impact of loading time but also these have size limits for the files you can attach. For example, the hard limit for a aws lambda layer is 250 MB. From experience, whenever I need to deal with parquet files, I use fast parquet instead of pyarrow due to the huge difference in sizes.

Sai123-prathyu commented 8 months ago

Thankyou its working

dwgillies commented 8 months ago

I too was hoping to use pandas in an embedded AWS lambda function. If the size explodes, this will be a huge overhead. I am currently using about 0.004% of the pandas library. From the looks of this discussion, my usage will not change nor will I ever need pyarrow but I will now be using 0.0015% of the pandas library, and paying dearly for it, probably by abandoning this bloated software.

I have found and verified that the deprecation warning can be suppressed with this : https://github.com/pandas-dev/pandas/issues/54466#issuecomment-1919988166

Does anyone have a procedure for installing pyarrow in cygwin?

Note: straightforward installation does not work.

cygwin$ python3.9 -m pip install pyarrow
...
      -- Generator: Unix Makefiles
      -- Build output directory: /tmp/pip-install-obx0lyoa/pyarrow_5ca48afb32b3451db0badc556c1c74fc/build/temp.cygwin-3.5.0-x86_64-cpython-39/release
      -- Found Python3: /usr/bin/python3.9.exe (found version "3.9.16") found components: Interpreter Development.Module NumPy
      -- Found Python3Alt: /usr/bin/python3.9.exe
      CMake Error at CMakeLists.txt:268 (find_package):
        By not providing "FindArrow.cmake" in CMAKE_MODULE_PATH this project has
        asked CMake to find a package configuration file provided by "Arrow", but
        CMake did not find one.

        Could not find a package configuration file provided by "Arrow" with any of
        the following names:

          ArrowConfig.cmake
          arrow-config.cmake

        Add the installation prefix of "Arrow" to CMAKE_PREFIX_PATH or set
        "Arrow_DIR" to a directory containing one of the above files.  If "Arrow"
        provides a separate development package or SDK, be sure it has been
        installed.

      -- Configuring incomplete, errors occurred!
      See also "/tmp/pip-install-obx0lyoa/pyarrow_5ca48afb32b3451db0badc556c1c74fc/build/temp.cygwin-3.5.0-x86_64-cpython-39/CMakeFiles/CMakeOutput.log".
      error: command '/usr/bin/cmake' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pyarrow
Failed to build pyarrow
ERROR: Could not build wheels for pyarrow, which is required to install pyproject.toml-based projects

yangrudan commented 8 months ago

my code is simple:

"""
Copyright (c) Cookie Yang. All right reserved.
"""
from __future__ import print_function, division
import os
import torch
import pandas as pd
#用于更容易地进行csv解析
from skimage import io, transform
#用于图像的IO和变换
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils
# 忽略警告
import warnings
warnings.filterwarnings("ignore")
plt.ion()
# interactive mode

when I run srcipt:

python pic_io_csv.py

/home/yangrudan/workspace/demo/pytorch_learn/pic_io_csv.py:7: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you

maksym-petrenko commented 8 months ago

What is the minimum version of PyArrow that will work with pandas?

jsolbrig commented 8 months ago

@dwgillies I don't use Cygwin so I can only help a little with your installation issue. Pyarrow doesn't provide a wheel for your OS and architecture. So, pip is trying to build a wheel from source. In order to build form source, pyarrow requires that you have libarrow installed. If you install libarrow, then try to pip install again, it might work.

sahilfatima commented 8 months ago

Importing model

Soft-Buddy commented 8 months ago

I don't consider this a good decision, a huge increment in the installation size will be there :(

miraculixx commented 8 months ago

@dwgillies https://github.com/pandas-dev/pandas/issues/54466#issuecomment-1955241211

Does anyone have a procedure for installing pyarrow in cygwin? Note: straightforward installation does not work.

That's a great question - many companies rely on Python + Pandas running in cygwin, mingw (through git-bash) and Msys in their Windows work PCs. It is often the best way to have a useful Python dev env in a corporate environment.

Will Pandas+PyArrow be supported in these environments? If not there is a high risk of lots of outdated installations bc these environments are rather sticky once deployed, and there is no easy way to upgrade to Linux or WSL.

Soft-Buddy commented 8 months ago

@dwgillies #54466 (comment)

Does anyone have a procedure for installing pyarrow in cygwin? Note: straightforward installation does not work.

That's a great question - many companies rely on Python + Pandas running in cygwin, mingw (through git-bash) and Msys in their Windows work PCs. It is often the best way to have a useful Python dev env in a corporate environment.

Will Pandas+PyArrow be supported in these environments? If not there is a high risk of lots of outdated installations bc these environments are rather sticky once deployed, and there is no easy way to upgrade to Linux or WSL.

Our issues kinda match buddy. I use pandas in my android app which ships a cross compiled copy of python and of pandas compiled using crossenv. PyArrow's installation doesn't work there either... And triggers some weird errors

jkmackie commented 8 months ago

My general concern with the mandatory PyArrow dependency is chasing competing standards and dependency issues like bugs.

Kindly recall PDEP 10 lists three key benefits of pyarrow: (1) better pyarrow string memory/speed; (2) nested datatypes; and (3) interoperability.

PDEP Point 1 - Pandas 2.2.0 Performance To make this less abstract, below are Pandas performance stats based on the 1brc challenge of aggregating 1 billion rows of city temperatures as fast as possible.

1brc INPUT - 1 billion rows 2 columns: city and temperature 1brc_input

OUTPUT - Temp mean/min/max by city 1brc_output

Metrics metrics

Memory These metrics use the default DataFrame format: city is 'object' and temperature is 'float64'.

Turns out the city column 'object' format hogs :pig: 90% of the 'deep' memory usage ⵜ. This is indeed an issue! The last 10% of memory is temperatures. Downcasting to 'float32' halves memory for the temperature column.

_{ⵜ Memory Footnote:} _{There's a mismatch between dataframe 'deep' memory usage (69GB) and the PC RAM increase I saw in Task Manager (about 23-24 GB) during pd.read_parquet(). My system memory is 64GB. Hard to believe 2GB memory compression accounts for the discrepancy.}

Speed Reading from parquet is 2.5 times as fast as reading from csv and takes one-fifth the space (snappy compression). Mean/min/max aggregation time was reasonable at under one minute.

PDEP Point 2 - Nesting The PDEP 10 nested datatype example saves [{'a': 1, 'b': 2}, {'a': 2, 'b': 99}] to a Series rather than a DataFrame. The pyarrow benefit is saving an unknown nested structure as speed/memory efficient strings.

The existing alternative is use pd.json_normalize() or pd.DataFrame() to load the example into a DataFrame with a column for eack key. Foreknowledge of the format is required. Then downcast numeric columns with pd.to_numeric(df[mycol], downcast=<'integer', 'signed', 'unsigned', or 'float'>).

PDEP Point 3 - Interoperability What about potential PyArrow C++ binding issues? Is this straightforward to debug and fix?

TAKEAWAYS Pandas stock performance is good. :sunglasses: With foreknowledge of the nested format, data can be flattened into a DataFrame (with a column for each key). Numbers are downcasted one column at a time.

The standout issue to me is the dtype: object. Why not build a solution in Pandas or NumPy?

hagenw commented 8 months ago

BTW, reading in a CSV file or parquet file is still faster by a factor of 5 for me when I do the reading with pyarrow and then convert to a pandas.DataFrame (but yes, using pyarrow as datatype for string is then faster then using object), compared to reading directly with pandas using pyarrow as engine.

jkmackie commented 8 months ago

BTW, reading in a CSV file or parquet file is still faster by a factor of 5 for me when I do the reading with pyarrow and then convert to a pandas.DataFrame (but yes, using pyarrow as datatype for string is then faster then using object), compared to reading directly with pandas using pyarrow as engine.

@hagenw Would you kindly explain the below result? Looks like parquet uses a lot more peak RAM.

csv_vs_parquet

Windows users: In general, what is the large discrepancy between DataFrame memory shown by df.info(memory_usage = 'deep') versus Windows Task Manager (below is a Task Manager memory metric pic)? What is the right 'real-world' memory metric?

task_manager

hagenw commented 8 months ago

I measured peak memory consumption with memray, but I'm not completely sure if I did it correctly. I have some updated results in the dev branch (https://github.com/audeering/audb/tree/dev/benchmarks), there we see the following

So it seems to be more equal. The code I used to measure memory consumption is available at https://github.com/audeering/audb/blob/44de33f0fea1f4d003882d674dc696a8f0cfe95d/benchmarks/benchmark-dependencies-save-and-load.py. That uses memray and writes the results to binary files that you need to inspect afterwards to extract the result.

ebuchlin commented 7 months ago

There have been / are some efforts to reduce the size of pandas (#30741), these efforts should not be wasted by a dependency which could perhaps remain optional (although I have no idea whether this is feasible). +120MB multiplied by the number of installs/environments/images/CI runs is not so small. It takes more time to download and install, more network usage, more storage... It's neither green, nor inclusive for situations/people/institutes/countries where resources are not as easily available as where these decisions are taken.

WillAyd commented 5 months ago

@susmitpy @dwgillies @admajaus pinging as the people that I think mentioned lambda in this thread.

AWS already has a tool called "AWS SDK for pandas" which itself requires pyarrow. There might be confusion on how AWS counts size limits (see https://github.com/aws/aws-sdk-pandas/issues/2761) but looks like it is definitely possible to run pandas + pyarrow in lambda.

Does this cover the concern for that platform?

susmitpy commented 5 months ago

@WillAyd

More often than not we need more than one library in an aws lambda function. There is a hard set limit of 250 MB. With pandas increasing from 70 MB to 190 MB (according to one of the posts above) that leaves only 60 MB for other libraries. Pandas being so helpful, powerful and convenient is always the go to choice for dealing with data, however it being the cause due to which "along with pandas you cannot use more 1-2 libraries" will be a big issue.

cc: @dwgillies @admajaus

WillAyd commented 5 months ago

Have you tried the layer in the link above? It is not going to be a 120 MB increase because AWS is not building a pyarrow wheel with all of the same options - looks like they remove Gandiva and Flight support

susmitpy commented 5 months ago

@WillAyd Just tried it.

179 MB is the layer's size.

WillAyd commented 5 months ago

Very helpful thanks. And the size of your current pandas + numpy + botocore + fastparquet images are significantly smaller than that?

susmitpy commented 5 months ago

I don't think that's a proper comparison as AWS data Wrangler will also have support to read parquet files for which for now I resort to fastparquet for it's smaller size.

On Thu, 9 May 2024, 19:28 William Ayd, @.***> wrote:

Very helpful thanks. And the size of your current pandas + numpy + botocore images are signifcantly smaller than that?

— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/54466#issuecomment-2102716066, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIGHCM3KKESGOBDORQENI23ZBN6HNAVCNFSM6AAAAAA3JOMQ4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBSG4YTMMBWGY . You are receiving this because you were mentioned.Message ID: @.***>

susmitpy commented 5 months ago

Also to fetch files from S3 while avoiding downloading file and then loading, s3fs is required which I guess won't be required when using AWS sdk (not sure though).

WillAyd commented 5 months ago

Yea ultimately what I'm trying to guage is how big of a difference it is. I don't have access to any lambda environments, but locally if I install your stack of pandas + numpy + fastparquet + botocore I get the following installation sizes in my site-packages folder:

75M pandas
39M numpy
37M numpy.libs
25M botocore
16M pip
7.9M    fastparquet

Adding up to almost 200 MB just from those packages alone.

If AWS is already distributing an image with pyarrow that is smaller than this then I'm unsure about the apprehension to this proposal on account of lambda environments. Is there a significant use case why users cannot use the already distributed AWS environment that includes pandas + pyarrow and if so why should that be something that holds pandas developers back from requiring pyarrow?

h-vetinari commented 5 months ago

As of a few hours ago, there's a pyarrow-core on conda-forge (only for the latest v16), which should substantially cut down on the foot print.

The split of the cloud provider bindings out of core hasn't happened yet, but will further reduce the footprint once it happens.

MarcoGorelli commented 5 months ago

I honestly don't understand how mandating a 170% increase in the effective size of a pandas installation (70MB to 190MB, from the numbers in the quoted text) can be considered okay.

I think the pdep text wasn't precise here - pandas and numpy each require about 70MB (in fact, a bit more now, I just checked). So the percentage of the increase is more like 82% - not 170%. Still quite a lot, I don't mean to minimise it, but at lot less than has been stated here.

It's good to see that on the conda-forge side, things have become smaller. For the PyPI package, however, my understanding is that this is unlikely to happen any time soon

Have you tried the layer in the link above

I just tried this, and indeed, it works - pandas 2.2.2 and pyarrow 14.0.1 are included. I don't think it's as flexible as being able to install whichever versions you want, but it does seem like there is a workable way to use pandas in Lambda

tazzben commented 3 months ago

I would ask the pandas developers to consider the impact of this decision on PyScript/Pyodide. The ability to develop statistical tools that can be deployed as a web app (where it is using their CPU and not a server) is a game changer, but it does mean the web browser is downloading all the packages the site needs. I'd also note, that many packages (e.g., Scipy) require numpy, so the likely result is that both packages will end up being downloaded.

I'd also ask the developers consider numba (outside the WASM environment). A lot of scientific code is accelerated by numba which implements parts of numpy (among other things). My point is that it is unlikely this code can just be replaced with pyarrow code. Again, both will end up being installed.

opresml commented 3 months ago

I think more people will comment on this in the form of backlash when they realize it has been done without them being aware. While we understand the value of PyArrow, it is not an absolute necessity for pandas as demonstrated by historical performance and adoption. PyArrow is already available for those that need/want it. Pandas should have pyarrow integration but not as a requirement for Pandas to function. As a pyodide/wasm developer , I can attest that payload size is paramount. Pyarrow is just too big. Make the PyArrow integration easy, but not mandatory. Think about more than the big data use case.

sam-s commented 3 months ago

Updating to numpy2 required reinstalling pyarrow. Then I got

Windows fatal exception: code 0xc0000139

Thread 0x00009640 (most recent call first):
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 1289 in create_module
  File "<frozen importlib._bootstrap>", line 813 in module_from_spec
  File "<frozen importlib._bootstrap>", line 921 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1331 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1360 in _find_and_load
  File "${localappdata}\miniconda3\envs\c312\Lib\site-packages\pyarrow\__init__.py", line 65 in <module>
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 995 in exec_module
  File "<frozen importlib._bootstrap>", line 935 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1331 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1360 in _find_and_load
  File "${localappdata}\miniconda3\envs\c312\Lib\site-packages\pandas\compat\pyarrow.py", line 8 in <module>
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 995 in exec_module
  File "<frozen importlib._bootstrap>", line 935 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1331 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1360 in _find_and_load
  File "${localappdata}\miniconda3\envs\c312\Lib\site-packages\pandas\compat\__init__.py", line 27 in <module>
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 995 in exec_module
  File "<frozen importlib._bootstrap>", line 935 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1331 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1360 in _find_and_load
  File "${localappdata}\miniconda3\envs\c312\Lib\site-packages\pandas\__init__.py", line 26 in <module>

uninstalling pyarrow removed 37(!) packages, and also removed the above error.

The point is that an extra dependency (especially such a huge one) increases fragility. I sympathize with the developers' desire to simplify their lives, but, as a user, I see only costs and no benefits in pyarrow.

rohitbewoor-ebmpapst commented 2 months ago

Hi, Thank you for asking for feedback on this. All the points already raised about package size with pyarrow, wheels, default packages of ubuntu, etc are my concerns as well. Therefore, I propose if this is even possible: 1) Keep pandas 3.0 rollout without pyarrow. Old codes bases continue to import and use as they did before. 2) Create a totally new package e.g. pandasarrow. New projects use this always. Old projects switch to importing this if it makes sense. 3) Usually we always "import pandas as pd" and then continue. So this way a switch to either "import pandasarrow as pd" or "import pandas as pd" would be easy to do. My two cents.

soulphish commented 2 months ago

Not to beat a dead horse, but....

I use Pandas in multiple projects, and each project has a Virtual Environment. Every new major version of python gets a virtual environment for testing the new version too. The size of these project is not huge, but now they have all increased massively, and the storage requirement for projects has increased almost exponentially.

Just something to keep in mind. I know there is talk of pyarrow being reduced in size too, which would be great. I admit, I have not read the full discussion, so this may have been covered already, and I apologize if it has been.

agriyakhetarpal commented 2 months ago

Hi all – not to segue into the discussion about the increase in bandwidth usage and download sizes since many others have put out their thoughts about that already, but PyArrow in Pyodide has been merged and will be available in the next release: https://github.com/pyodide/pyodide/pull/4950/

Runa7debug commented 4 weeks ago

I find this error in the lab of module 2-course 3 data science:

:1: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd # import library to read data into dataframe

bersbersbers commented 3 weeks ago

It's a bit unfortunate that with pyarrow dependencies, using pandas on Python 3.13 is now effectively blocked by https://github.com/apache/arrow/issues/43519. Making pyarrow required will aggravate such issues in the future.

miraculixx commented 2 weeks ago

Reading this thread, it appears that after more than 12 months of collecting feedback, most comments are not in favor of pyarrow being a dependency, or at least voice some concern. I haven't done a formal analysis, but it appears there are a few common themes:

Concerns

Pyarrow's package size is considered to be very/too large for a mandatory dependency
There is additional and often unwarranted complexity in pyarrow installation (e.g. version conflicts, platform not supported)
Pyarrow's functionality is not needed for all of pandas use cases and hence having to install it seems unnecessary in these cases

Suggested paths forward

a. Make it easy to use pandas with pyarrow, yet keep it an optional dependency b. Make it easy to install pyarrow by reducing its size and installation complexity (with pandas, e.g. by reducing dependency to pyarrow-base instead of the full pyarrow)

(I may be biased in summarizing this, anyone feel free to correct this if you find your analysis is different)

Since this is a solicited feedback channel established for the community to share their thoughts regarding PDEP-10, (how) will the decision be reconsidered @phofl? Thank you for all your efforts.

asishm commented 2 weeks ago

Since this is a solicited feedback channel established for the community to share their thoughts regarding PDEP-10, (how) will the decision be reconsidered @phofl? Thank you for all your efforts.

There is an open PDEP under consideration to reject pdep-10. https://github.com/pandas-dev/pandas/pull/58623 If (when?) it gets finalized, it'll get put to a vote.

pandas-dev / pandas