FEEDBACK: PyArrow as a required dependency and PyArrow backed strings

phofl commented 1 year ago

This is an issue to collect feedback on the decision to make PyArrow a required dependency and to infer strings as PyArrow backed strings by default.

The background for this decision can be found here: https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html

If you would like to filter this warning without installing pyarrow at this time, please view this comment: https://github.com/pandas-dev/pandas/issues/54466#issuecomment-1919988166

MarcoGorelli commented 7 months ago

More updates coming in due course

As promised: https://github.com/pandas-dev/pandas/issues/57073

lesteve commented 7 months ago

Alternatively, if you want to just silence the warning for now:

It is quite unfortunate that the warning message starts with a newline, which makes it hard to target speficically by message with python -W or PYTHONWARNINGS, unless I missed something. For example there is still a warning with this command:

python -W 'ignore:\nPyarrow:DeprecationWarning' -c 'import pandas'

I opened https://github.com/pandas-dev/pandas/issues/57082 about it.

Youjin1985 commented 7 months ago

Please remove deprecation warning every time pandas is imported! For example, make it to appear only if some specific file does not exist, and deprecation message should tell user which file to create to suppress the warning.

AndrewAmmerlaan commented 7 months ago

Note that pyarrow currently does not build with pypy: https://github.com/apache/arrow/issues/19046

I checked just now and indeed found compilation failure:

FAILED: CMakeFiles/lib.dir/lib.cpp.o
/usr/bin/x86_64-pc-linux-gnu-g++ -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 -DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -Dlib_EXPORTS -I/tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python/pyarrow/src -I/tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/pyarrow/src -isystem /usr/include/pypy3.10 -isystem /usr/lib/pypy3.10/site-packages/numpy/core/include -Wno-noexcept-type -Wno-self-move  -Wall -fno-semantic-interposition -msse4.2 -march=native -mtune=native -O3 -pipe -frecord-gcc-switches -flto=16 -fdiagnostics-color=always -march=native -mtune=native -O3 -pipe -frecord-gcc-switches -flto=16 -fno-omit-frame-pointer -Wno-unused-variable -Wno-maybe-uninitialized -O3 -DNDEBUG -O2 -ftree-vectorize  -std=c++17 -fPIC -Wno-unused-function -Winvalid-pch -include /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/CMakeFiles/lib.dir/cmake_pch.hxx -MD -MT CMakeFiles/lib.dir/lib.cpp.o -MF CMakeFiles/lib.dir/lib.cpp.o.d -o CMakeFiles/lib.dir/lib.cpp.o -c /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/lib.cpp
In file included from /usr/include/pypy3.10/Python.h:55,
from /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python/pyarrow/src/arrow/python/platform.h:27,
from /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python/pyarrow/src/arrow/python/pch.h:24,
from /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/CMakeFiles/lib.dir/cmake_pch.hxx:5,
from <command-line>:
/tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/lib.cpp: In function ‘PyObject* __pyx_pf_7pyarrow_3lib_17SignalStopHandler_6__exit__(__pyx_obj_7pyarrow_3lib_SignalStopHandler*, PyObject*, PyObject*, PyObject*)’:
/tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/lib.cpp:41444:7: error: ‘PyPyErr_SetInterrupt’ was not declared in this scope; did you mean‘PyErr_SetInterrupt’?
41444 |       PyErr_SetInterrupt();
|       ^~~~~~~~~~~~~~~~~~

stonebig commented 7 months ago

VirusTotal is not always happy with Pyarrow wheels... example on 15.0 https://www.virustotal.com/gui/file/17d53a9d1b2b5bd7d5e4cd84d018e2a45bc9baaa68f7e6e3ebed45649900ba99

migurski commented 7 months ago

+1 to making it easier to silence the warning. I have no opinion on the pyarrow dependency change but the red warning text in notebook outputs is distracting when they’re meant to be published or shared with colleagues.

MarcoGorelli commented 7 months ago

VirusTotal is not always happy with Pyarrow wheels... example on 15.0 https://www.virustotal.com/gui/file/17d53a9d1b2b5bd7d5e4cd84d018e2a45bc9baaa68f7e6e3ebed45649900ba99

Wasn't aware of that, thanks - is it happy with the current pandas wheels as they are? Is this fixable on the VirusTotal side, and if so, could it be reported to them?

stonebig commented 7 months ago

It's happy with latest pandas wheels

glatterf42 commented 7 months ago

Trying to simply install pyarrow to silence the DeprecationWarning causes our tests to fail, e.g.:

FAILED tests/core/test_meta.py::test_run_meta[test_sqlite_mp] - pyarrow.lib.ArrowNotImplementedError: Function 'not_equal' has no kernel matching input types (large_string, double)

I'm not entirely sure why this happens and it only does when pandas[feather] is installed, not with pandas itself. So I guess I'll keep the warning until a much-appreciated migration guide clarifies how to address this issue (if pyarrow ends up being required).

phofl commented 7 months ago

@glatterf42 could you copy paste the test content?

glatterf42 commented 7 months ago

Sure :)

There is more than one test, but they all boil down to the same line:

Full traceback of one test

```python ______________________________________________________ test_run_meta[test_sqlite_mp] _______________________________________________________ test_mp = , request = > @all_platforms def test_run_meta(test_mp, request): test_mp = request.getfixturevalue(test_mp) run1 = test_mp.runs.create("Model 1", "Scenario 1") run1.set_as_default() # set and update different types of meta indicators > run1.meta = {"mint": 13, "mfloat": 0.0, "mstr": "foo"} tests/core/test_meta.py:18: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ixmp4/core/run.py:52: in meta self._meta._set(meta) ixmp4/core/run.py:122: in _set self.backend.meta.bulk_upsert(df) ixmp4/core/decorators.py:15: in wrapper return checked_func(*args, **kwargs) .venv/lib/python3.10/site-packages/pandera/decorators.py:754: in _wrapper out = wrapped_(*validated_pos.values(), **validated_kwd) ixmp4/data/auth/decorators.py:37: in guarded_func return func(self, *args, **kwargs) ixmp4/data/db/meta/repository.py:194: in bulk_upsert super().bulk_upsert(type_df) ixmp4/data/db/base.py:339: in bulk_upsert self.bulk_upsert_chunk(df) ixmp4/data/db/base.py:357: in bulk_upsert_chunk cond.append(df[col] != df[updated_col]) .venv/lib/python3.10/site-packages/pandas/core/ops/common.py:76: in new_method return method(self, other) .venv/lib/python3.10/site-packages/pandas/core/arraylike.py:44: in __ne__ return self._cmp_method(other, operator.ne) .venv/lib/python3.10/site-packages/pandas/core/series.py:6099: in _cmp_method res_values = ops.comparison_op(lvalues, rvalues, op) .venv/lib/python3.10/site-packages/pandas/core/ops/array_ops.py:330: in comparison_op res_values = op(lvalues, rvalues) .venv/lib/python3.10/site-packages/pandas/core/ops/common.py:76: in new_method return method(self, other) .venv/lib/python3.10/site-packages/pandas/core/arraylike.py:44: in __ne__ return self._cmp_method(other, operator.ne) .venv/lib/python3.10/site-packages/pandas/core/arrays/arrow/array.py:704: in _cmp_method result = pc_func(self._pa_array, self._box_pa(other)) .venv/lib/python3.10/site-packages/pyarrow/compute.py:246: in wrapper return func.call(args, None, memory_pool) pyarrow/_compute.pyx:385: in pyarrow._compute.Function.call ??? pyarrow/error.pxi:154: in pyarrow.lib.pyarrow_internal_check_status ??? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E pyarrow.lib.ArrowNotImplementedError: Function 'not_equal' has no kernel matching input types (large_string, double) pyarrow/error.pxi:91: ArrowNotImplementedError ```

Verbose description

The test is defined [here](https://github.com/iiasa/ixmp4/blob/2b6f5eff52d2a36904264126f619b21438db7df1/tests/core/test_meta.py#L12-L18) with the fixtures coming from [here](https://github.com/iiasa/ixmp4/blob/2b6f5eff52d2a36904264126f619b21438db7df1/tests/utils.py#L26-L34) and [here](https://github.com/iiasa/ixmp4/blob/2b6f5eff52d2a36904264126f619b21438db7df1/tests/conftest.py#L112-L137). The line in question is in [ixmp4/data/db/base.py](https://github.com/iiasa/ixmp4/blob/2b6f5eff52d2a36904264126f619b21438db7df1/ixmp4/data/db/base.py#L344-L357) in the `bulk_upsert_chunk()` function. It combines a `pandas.DataFrame` from an existing and to-be-added one and is then trying to figure out which of the columns was updated. There's a [limited set of columns](https://github.com/iiasa/ixmp4/blob/2b6f5eff52d2a36904264126f619b21438db7df1/ixmp4/data/db/meta/model.py#L30-L36) that may be updated. During the combination process, the to-be-added columns receive a `_y` suffix to be distinguishable. If such an updatable column is found the the combined dataframe, a bool should be added to a list if it's truly different from the existing one. And precisely this condition check, `df[col] != df[updated_col]`, fails when pyarrow is present.

ItsSatviK13 commented 7 months ago

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you,

I am getting this error after trying to import pandas

jagoodhand commented 7 months ago

A little late to the party, but wanted to add an objection from me due to the hugely increased installation size from PyArrow.

Primarily, this relates to AWS Lambda. I use Pandas significantly in the AWS Lambda environment, and this would cause headaches. I think it is just possible to get Pandas and PyArrow into a Lambda package, but means there is very little room for anything else in there.

I tried to experiment with this recently, and couldn't get it smaller enough to the point I could have the other stuff in the package that I wanted. I believe the work-around is to use containers with Lambda instead, but this requires a whole shift in deployment methodology for a single package dependency. There would be a further trade-off from the increased start times due to having to load a significantly larger package (or container).

I realise that this environment-specific objection may not have much weight, but my other comment would be:

Pandas is generally one of the first, approachable ways for new users to start playing around with data, and data-science tools. Specifically, a tool that can then be scaled towards more advanced usage. My experience has been that installing PyArrow can be a complex process, filled with pit-holes that can make what is currently a relatively simple installation process, a real headache. I think that this change could really harm the approachability of Pandas, and put off future users.

I would strongly request that PyArrow remain an optional dependency that advanced users (who by definition would be able to handle any installation requirements), can install and configure if necessary.

alippai commented 7 months ago

Next to pyarrow and numpy, related (recent) literature https://pola.rs/posts/polars-string-type/

putulsaini commented 7 months ago

whenever i am using pandas..this pyArrow showing and everytime i'm getting problem of using pandas, everytime i'm running pandas in python.please help

Rich5 commented 7 months ago

Sorry if I'm missing this somewhere, but is there a way to silence this warning?

jorisvandenbossche commented 7 months ago

is there a way to silence this warning?

Install pyarrow!

Or if you still want to avoid doing that for now, you can silence the warning with the stdlib warnings.filterwarning module:

>>> import warnings
>>> warnings.filterwarnings("ignore", "\nPyarrow", DeprecationWarning)
>>> import pandas

(unfortunately it currently doesn't work as -W command line argument or pytest config option, see https://github.com/pandas-dev/pandas/issues/57082)

Rich5 commented 7 months ago

Perfect! Thanks @jorisvandenbossche

Ygrik1308 commented 7 months ago

Warning (from warnings module): File "C:/Git/Work/Pyton/Pandas_ecel.py", line 1 import pandas as pd DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

ZupoLlask commented 7 months ago

@jagoodhand I may have got it wrong but, from my understanding, by the time PyArrow becomes a mandatory dependency of Pandas 3.0.0, that dependency will be a new package that doesn't exists today basically around libarrow called pyarrow-minimal, that will be much much smaller (in size) and portable (in terms of CPU architectures, that may narrow the gap to Numpy current availability in that matter) and will be released with PyArrow 15.

@h-vetinari & devs, please correct me if I'm wrong...

jagoodhand commented 7 months ago

@ZupoLlask - if it addresses the two issues I mentioned:

Package Size
Installation complexity / compatibility / portability i.e. easily being able to install on different platforms

Then my objections aren't objections any more, but it doesn't sound like this is the case. Would be good to have more detail or confirmation on what this would look like though.

raulcd commented 7 months ago

by the time PyArrow becomes a mandatory dependency of Pandas 3.0.0, that dependency will be a new package that doesn't exists today basically around libarrow called pyarrow-minimal, that will be much much smaller (in size) and portable (in terms of CPU architectures, that may narrow the gap to Numpy current availability in that matter) and will be released with PyArrow 15.

This is not exactly the case. Let me expand a little on what is happening at the moment:

The Arrow team did release Arrow and pyarrow 15.0.0 a couple of weeks ago. There is some ongoing work and efforts from the Arrow community in reducing the footprint of minimal builds of Arrow. At the moment there is an opened PR on the conda feedstock for Arrow, which I am working on, to be able to have several different installations for pyarrow. Based on review and design discussions it seems there will be pyarrow-core, pyarrow and pyarrow-all with different subsets of features and sizes.

There is no change about the current CPU architectures supported but please if your system is not supported you can always open an issue or a feature request to the Arrow repository.

We still have to plan and do the work for published wheels on PyPI but this still requires planning and contributors to actively work on. Some issues that are related: https://github.com/apache/arrow/issues/24688

amol- commented 7 months ago

We still have to plan and do the work for published wheels on PyPI but this still requires planning and contributors to actively work on. Some issues that are related: apache/arrow#24688

For the purpose of being able to package PyArrow in smaller wheels, I had created https://github.com/amol-/consolidatewheels but it would require some real world testing. https://github.com/amol-/wheeldeps was created as an example, but the more testing we can get, the faster we will be able to split pyarrow wheels

CotaZ commented 7 months ago

Bueno sí algunos le preocupa la ram, lo optimo son 16 gb para hacer trabajos solidos, pero bueno cada uno ve su alcance con su cliente.

max-radin commented 7 months ago

This is not feedback on the decision to include pyarrow as a dependency, but rather on the usage of deprecation warnings to solicit feedback. Raising deprecation warnings (especially in the main __init__.py) adds a lot of noise to downstream projects. It also creates a development burden for packages whose CI treats warnings as errors (see for example https://github.com/bokeh/bokeh/issues/13656 and https://github.com/zapatacomputing/orquestra-cirq/pull/53). Ideally deprecation warnings would be reserved for changes to the pandas public API and be raised only in the code paths affected by the change.

mynewestgitaccount commented 7 months ago

This is not feedback on the decision to include pyarrow as a dependency, but rather on the usage of deprecation warnings to solicit feedback. [...] Ideally deprecation warnings would be reserved for changes to the pandas public API and be raised only in the code paths affected by the change.

Even by your own logic, including a warning was the right choice. The inclusion of PyArrow will come with a major change to the pandas public API: "Starting in pandas 3.0, the default type inferred for string data will be ArrowDtype with pyarrow.string instead of object." (quoted from the PDEP)

However, I think a FutureWarning, as originally was proposed in the PDEP, would have made more sense than the DeprecationWarning that was implemented.

Regardless, if the deprecation warning creates issues for you, you can just install PyArrow to make it go away. If installing PyArrow would create issues for you, that's what this issue is for. Considering the change can cause CI failures, the warning preemptively causing CI failures seems like the lesser of two bad options.

Bias disclosure: I'm impacted negatively by the upcoming change.

MohamedElashri commented 7 months ago

I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow as they don't get packaged like CentOS 7 and ubuntu 18.04 LTS

Alexey174 commented 7 months ago

the first thought that arose was to replace pandas with another similar tool

BMaxV commented 7 months ago

I dislike the process here, and I don't mean the dep warning.

Why are you doing this? What are the pros and cons? Surely you have discussed doing this and there are pros and cons for this. Please link to that discussion. ("If it's difficult to explain, it's a bad idea")
Why is increasing the complexity of your package the default and correct way of providing this functionality?

I appreciate the message and asking for feedback, but it went out to everyone and that will include people like me who have no idea what's going on. It is generally your business how you run your project (Thank you for your work and software), but if you do want feedback and if you do want to be inclusive, please think about how you are onboarding to this issue.

Generally, complexity is bad and changing things in bad, because there is the risk of new errors. So you are starting at a negative score in my book and this whole thing would require a significant gain and not just a neutral tradeoff between increased size and some performance.

(I think there is a general blindness in this respect from package maintainers, because you are working with this every day and you think some increase in complexity is acceptable for [reasons] and this continues for decades and then you have a bloated mess.)

Does it have to be done this way, can't you create a new package that uses the advantages of both packages and overrides the original function? Then if people want to they can use both and it leaves the original thing untouched. Maybe put a note into the docs pointing to the optimization.

asishm-wk commented 7 months ago

Surely you have discussed doing this and there are pros and cons for this. Please link to that discussion.

The discussion is linked in the PDEP itself - https://github.com/pandas-dev/pandas/pull/52711

jfaccioni-asimov commented 7 months ago

I know this isn't super relevant to the discussion, but I want to throw this out here anyway. Sometimes, even a harmless change like displaying a DeprecationWarning can have undesired repercussions.

I teach Python courses for programming beginners, and since the 2.2.0 release I've received many questions and messages from students confused by the warning. They are left wondering if they installed pandas correctly, if they need to install something called "arrow", or whether they can continue the course at all.

Yes, I know the students should eventually get used to warning messages, and this discussion is definitely relevant to the Data Science community. But realistically, 99% of the people to ever import pandas as pd will never come remotely close to it.

As stated previously, if pyarrow ever becomes a dependency of pandas (disregarding whether that's a good or a bad thing), the vast majority of users shouldn't even notice any difference. Everything should "just work" when they type pip install pandas. As a result, I find the decision to display a DeprecationWarning to the entire user base upon importing pandas unfortunate.

ZupoLlask commented 7 months ago

Well, I think all these contributions for the discussion end up being useful for the community as a whole.

Maybe developers may consider another approach regarding communication of deprecation:

including major pending deprecation warnings in the changelog / release notes for every new release;
creating some kind of verbose deprecation mode so interested developers can check and test their code future compatibility, while disabling this level of DeprecationWarning verbosity disabled for the regular users.

There is no perfect solution to deal with the current situation, but I'm positive PyArrow will bring very good benefits for Pandas in the future! 🙂

wshanks commented 7 months ago

I want to follow up on https://github.com/pandas-dev/pandas/issues/54466#issuecomment-1906654486 from above about a pyarrow extra. The message just says that you need to have "Pyarrow". It would be better if it suggested installing pandas[feather] (or pandas[pyarrow] if feather does not just mean pyarrow). Adding transitive dependencies to a project''s dependency list should be avoided if possible. From the warning message, it seems that the suggested solution is to add pyarrow to your dependency list.

Also, since the warning directs users to this issue, it would be nice if the issue description were edited to include suggestions on how to avoid it -- both whether to add pyarrow to your dependencies or use pandas[feather] and also the filterwarnings solution.

jamesbraza commented 7 months ago

I agree @wshanks, I opened https://github.com/pandas-dev/pandas/pull/57284 to introduce that extra. If people like it, I can add a docs entry for Pandas 2.2.1

gipert commented 7 months ago

This change is making a mess in CI jobs. Suppressing the warning as suggested in https://github.com/pandas-dev/pandas/issues/54466#issuecomment-1919988166 is not a viable solution and I could not even find a robust way to code "exclude Pandas versions >=2.2 AND < 3" as a requirement specifier in pyproject.toml.

max-radin commented 7 months ago

This is not feedback on the decision to include pyarrow as a dependency, but rather on the usage of deprecation warnings to solicit feedback. [...] Ideally deprecation warnings would be reserved for changes to the pandas public API and be raised only in the code paths affected by the change.

Even by your own logic, including a warning was the right choice. The inclusion of PyArrow will come with a major change to the pandas public API: "Starting in pandas 3.0, the default type inferred for string data will be ArrowDtype with pyarrow.string instead of object." (quoted from the PDEP)

However, I think a FutureWarning, as originally was proposed in the PDEP, would have made more sense than the DeprecationWarning that was implemented.

Regardless, if the deprecation warning creates issues for you, you can just install PyArrow to make it go away. If installing PyArrow would create issues for you, that's what this issue is for. Considering the change can cause CI failures, the warning preemptively causing CI failures seems like the lesser of two bad options.

Bias disclosure: I'm impacted negatively by the upcoming change.

I agree that including a warning for string type inference makes sense. However I'm not sure that the main __init__.py is the best place for this warning because it creates noise for projects that do not depend on string type inference and therefore may not be affected by the change.

Also I understand that the warning can be suppressed by installing PyArrow. The point is that any approach to suppressing the warning requires a certain amount of knowledge and effort. I'm thinking for example of the questions that @jfaccioni-asimov gets from confused students.

hagenw commented 7 months ago

When switching to pyarrow for the string dtype, it would be good if some of the existing performance issues with the string dtype are addresses beforehand. Currently (pandas 2.2.0), string[pyarrow] is the slowest solution for some tasks:

import pandas as pd
import timeit

points = 1000000
data = [f"data-{n}" for n in range(points)]
for dtype in ["object", "string", "string[pyarrow]"]:
    index = pd.Index([f"index-{n}" for n in range(points)], dtype=dtype)
    df = pd.DataFrame(data, index=index, dtype=dtype)
    print(dtype)
    %timeit df.loc['index-2000']

which returns

object
9.78 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
string
15.7 µs ± 36.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
string[pyarrow]
17.6 µs ± 66.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

MarcSkovMadsen commented 7 months ago

I'm a contributor to Panel by HoloViz.

Pandas is used extensively in the HoloViz ecosystem. Its a hard requirement of Panel.

Usage in pyodide and pyscript has really benefitted us a lot. It has made our docs interactive and enabled our users to share live Python GUI applications in the browser without having to fund and manage a server.

As far as I can see Pyarrow does not work with pyodide. I.e. Pandas would no longer work in Pyodide? I.e. Panel would no longer work in Pyodide?

Thinking outside of HoloViz Panel I believe that making Pandas unusable in Pyodide or increasing the download time risks making all gains of Python in the browser with Pyodide and Pyscript unusable.

Thanks for asking for feedback. Thanks for Pandas.

lesteve commented 7 months ago

There is ongoing work about Pyarrow support in Pyodide, for example see https://github.com/pyodide/pyodide/issues/2933. If I try to use my crystal ball, my guess is that pandas developers have this in mind. Also even in the case of pandas 3.0 go out, require Pyarrow and Pyarrow support is still not there in Pyodide, you will always be able to use older pandas versions in Pyodide so unless you need a pandas 3.0 feature, you will be fine.

MarcSkovMadsen commented 7 months ago

Thx @lesteve .

Panel might not need the never version of Pandas. But users will also be using Pandas when they develop their data driven applications using Pandas and Panel. And they would expect to be on a recent version of Pandas.
And the package size of pyarrow would also increase down load time in pyodide considerably.

These issues are not limited to Panel. They will limit entire PyData ecosystem using pyodide to make their docs interactive without spending huge amounts on servers. They will also limit Streamlit (Stlite), Gradio (Gradiolite), Jupyterlite, PyScript etc. running in the browser. Which is where the next 10 million Python users are expected to come from.

wirable23 commented 7 months ago

Are there 3 distinct arrow string types in pandas?

"string[pyarrow_numpy]"
"string[pyarrow]"
pd.ArrowDtype(pa.string())

Is the default going to be string[pyarrow_numpy]? What are the differences between the 3 string datatypes and when should 1 be used over the other? Do they all perform the same because they use the same arrow memory layout and compute kernels?

mjugl commented 7 months ago

is there a way to silence this warning?

You can do it with the stdlib warnings.filterwarning module:
>>> import warnings
>>> warnings.filterwarnings("ignore", "\nPyarrow", DeprecationWarning)
>>> import pandas
(unfortunately it currently doesn't work as -W command line argument or pytest config option, see #57082)

If you're using pytest and the warnings are polluting your CI pipelines, you can ignore this warning by editing your pytest.ini like so:

[pytest]
filterwarnings =
    ignore:\nPyarrow:DeprecationWarning

See pytest docs on controlling warnings.

kanhaiya0318 commented 7 months ago

Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible. (System packages are not allowed to depend on non-system packages.)

I don't know whether creating a system package of pyarrow is possible with reasonable effort, or whether this would make the system pandas packages impossible to update (and eventually require their removal when old pandas was no longer compatible with current Python/numpy).

dwivedi281 commented 6 months ago

is there a way to silence this warning?

You can do it with the stdlib warnings.filterwarning module:
>>> import warnings
>>> warnings.filterwarnings("ignore", "\nPyarrow", DeprecationWarning)
>>> import pandas
(unfortunately it currently doesn't work as -W command line argument or pytest config option, see #57082)

I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow as they don't get packaged like CentOS 7 and ubuntu 18.04 LTS

dwivedi281 commented 6 months ago

I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow as they don't get packaged like CentOS 7 and ubuntu 18.04 LTS

dwivedi281 commented 6 months ago

I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow as they don't get packaged like CentOS 7 and ubuntu 18.04 LTS

dwivedi281 commented 6 months ago

I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow as they don't get packaged like CentOS 7 and ubuntu 18.04 LTS

mgorny commented 6 months ago

FYI, I've added pyarrow dep on 2024-01-20 to the Gentoo ebuild and requested testing on the architectures we support. So far it's looking grim — no success on ARM, AArch64, PowerPC, X86. I feel like I'm now being made responsible for fixing Arrow, that doesn't seem to be very portable in itself.

h-vetinari commented 6 months ago

Arrow, that doesn't seem to be very portable in itself.

We build arrow and run the test suite successfully on all the mentioned architecture in conda-forge, though admittedly the stack of dependencies is pretty involved (grpc, protobuf, the major cloud SDKs, etc.). Feel free to check out our recipe if you need some inspiration, or open an issue on the feedstock if you have some questions.

enzbus commented 6 months ago

Dear maintainers and core devs,

thank you for making Pandas available to the community. Since you ask for feedback, here's my humble opinion

As a longtime user and developer of open-source libraries which depend on Pandas, I mostly deal with (possibly) large Dataframes with homogeneous dtype (np.float64), and I treat them (for the most part) as wrapper around the corresponding Numpy 2-dimensional Arrays. The reason I use Pandas Dataframes as opposed to plain Numpy Arrays is that I find Pandas indexing capabilities to be its "killer" feature, it's much safer from my point of view to keep track of indexing in Pandas rather than Numpy, especially when considering Datetime indexes or multi-indexes. The same applies to Series and Numpy 1-dimensional Arrays.

I have no objections to using Arrow as back-end to store string, object dtypes, or in general non-homogeneous dtype Dataframes.

I would like, however, to hear whether you plan to switch away from Numpy as one of the core back-ends (in my usecases, the most important one). This is relevant for various reasons, including memory management. It would be great to know if in the future one will have to worry that manipulating large 2-dimensional Numpy Arrays of floats by casting them as Dataframes will involve a conversion into Arrow, and back to Numpy (if then I want them back as such). That would be very problematic, since it involves a whole new layer of complexity.

Thanks, Enzo

pandas-dev / pandas

FEEDBACK: PyArrow as a required dependency and PyArrow backed strings #54466