Open phofl opened 1 year ago
More updates coming in due course
As promised: https://github.com/pandas-dev/pandas/issues/57073
Alternatively, if you want to just silence the warning for now:
It is quite unfortunate that the warning message starts with a newline, which makes it hard to target speficically by message with python -W
or PYTHONWARNINGS
, unless I missed something. For example there is still a warning with this command:
python -W 'ignore:\nPyarrow:DeprecationWarning' -c 'import pandas'
I opened https://github.com/pandas-dev/pandas/issues/57082 about it.
Please remove deprecation warning every time pandas is imported! For example, make it to appear only if some specific file does not exist, and deprecation message should tell user which file to create to suppress the warning.
Note that pyarrow currently does not build with pypy: https://github.com/apache/arrow/issues/19046
I checked just now and indeed found compilation failure:
FAILED: CMakeFiles/lib.dir/lib.cpp.o
/usr/bin/x86_64-pc-linux-gnu-g++ -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 -DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -Dlib_EXPORTS -I/tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python/pyarrow/src -I/tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/pyarrow/src -isystem /usr/include/pypy3.10 -isystem /usr/lib/pypy3.10/site-packages/numpy/core/include -Wno-noexcept-type -Wno-self-move -Wall -fno-semantic-interposition -msse4.2 -march=native -mtune=native -O3 -pipe -frecord-gcc-switches -flto=16 -fdiagnostics-color=always -march=native -mtune=native -O3 -pipe -frecord-gcc-switches -flto=16 -fno-omit-frame-pointer -Wno-unused-variable -Wno-maybe-uninitialized -O3 -DNDEBUG -O2 -ftree-vectorize -std=c++17 -fPIC -Wno-unused-function -Winvalid-pch -include /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/CMakeFiles/lib.dir/cmake_pch.hxx -MD -MT CMakeFiles/lib.dir/lib.cpp.o -MF CMakeFiles/lib.dir/lib.cpp.o.d -o CMakeFiles/lib.dir/lib.cpp.o -c /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/lib.cpp
In file included from /usr/include/pypy3.10/Python.h:55,
from /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python/pyarrow/src/arrow/python/platform.h:27,
from /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python/pyarrow/src/arrow/python/pch.h:24,
from /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/CMakeFiles/lib.dir/cmake_pch.hxx:5,
from <command-line>:
/tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/lib.cpp: In function ‘PyObject* __pyx_pf_7pyarrow_3lib_17SignalStopHandler_6__exit__(__pyx_obj_7pyarrow_3lib_SignalStopHandler*, PyObject*, PyObject*, PyObject*)’:
/tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/lib.cpp:41444:7: error: ‘PyPyErr_SetInterrupt’ was not declared in this scope; did you mean‘PyErr_SetInterrupt’?
41444 | PyErr_SetInterrupt();
| ^~~~~~~~~~~~~~~~~~
VirusTotal is not always happy with Pyarrow wheels... example on 15.0 https://www.virustotal.com/gui/file/17d53a9d1b2b5bd7d5e4cd84d018e2a45bc9baaa68f7e6e3ebed45649900ba99
+1 to making it easier to silence the warning. I have no opinion on the pyarrow
dependency change but the red warning text in notebook outputs is distracting when they’re meant to be published or shared with colleagues.
VirusTotal is not always happy with Pyarrow wheels... example on 15.0 https://www.virustotal.com/gui/file/17d53a9d1b2b5bd7d5e4cd84d018e2a45bc9baaa68f7e6e3ebed45649900ba99
Wasn't aware of that, thanks - is it happy with the current pandas wheels as they are? Is this fixable on the VirusTotal side, and if so, could it be reported to them?
It's happy with latest pandas wheels
Trying to simply install pyarrow to silence the DeprecationWarning causes our tests to fail, e.g.:
FAILED tests/core/test_meta.py::test_run_meta[test_sqlite_mp] - pyarrow.lib.ArrowNotImplementedError: Function 'not_equal' has no kernel matching input types (large_string, double)
I'm not entirely sure why this happens and it only does when pandas[feather]
is installed, not with pandas itself. So I guess I'll keep the warning until a much-appreciated migration guide clarifies how to address this issue (if pyarrow ends up being required).
@glatterf42 could you copy paste the test content?
Sure :)
There is more than one test, but they all boil down to the same line:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you,
I am getting this error after trying to import pandas
A little late to the party, but wanted to add an objection from me due to the hugely increased installation size from PyArrow.
Primarily, this relates to AWS Lambda. I use Pandas significantly in the AWS Lambda environment, and this would cause headaches. I think it is just possible to get Pandas and PyArrow into a Lambda package, but means there is very little room for anything else in there.
I tried to experiment with this recently, and couldn't get it smaller enough to the point I could have the other stuff in the package that I wanted. I believe the work-around is to use containers with Lambda instead, but this requires a whole shift in deployment methodology for a single package dependency. There would be a further trade-off from the increased start times due to having to load a significantly larger package (or container).
I realise that this environment-specific objection may not have much weight, but my other comment would be:
Pandas is generally one of the first, approachable ways for new users to start playing around with data, and data-science tools. Specifically, a tool that can then be scaled towards more advanced usage. My experience has been that installing PyArrow can be a complex process, filled with pit-holes that can make what is currently a relatively simple installation process, a real headache. I think that this change could really harm the approachability of Pandas, and put off future users.
I would strongly request that PyArrow remain an optional dependency that advanced users (who by definition would be able to handle any installation requirements), can install and configure if necessary.
Next to pyarrow and numpy, related (recent) literature https://pola.rs/posts/polars-string-type/
whenever i am using pandas..this pyArrow showing and everytime i'm getting problem of using pandas, everytime i'm running pandas in python.please help
Sorry if I'm missing this somewhere, but is there a way to silence this warning?
is there a way to silence this warning?
Install pyarrow!
Or if you still want to avoid doing that for now, you can silence the warning with the stdlib warnings.filterwarning
module:
>>> import warnings
>>> warnings.filterwarnings("ignore", "\nPyarrow", DeprecationWarning)
>>> import pandas
(unfortunately it currently doesn't work as -W
command line argument or pytest config option, see https://github.com/pandas-dev/pandas/issues/57082)
Perfect! Thanks @jorisvandenbossche
Warning (from warnings module): File "C:/Git/Work/Pyton/Pandas_ecel.py", line 1 import pandas as pd DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
@jagoodhand I may have got it wrong but, from my understanding, by the time PyArrow becomes a mandatory dependency of Pandas 3.0.0, that dependency will be a new package that doesn't exists today basically around libarrow
called pyarrow-minimal
, that will be much much smaller (in size) and portable (in terms of CPU architectures, that may narrow the gap to Numpy current availability in that matter) and will be released with PyArrow 15.
@h-vetinari & devs, please correct me if I'm wrong...
@ZupoLlask - if it addresses the two issues I mentioned:
Then my objections aren't objections any more, but it doesn't sound like this is the case. Would be good to have more detail or confirmation on what this would look like though.
by the time PyArrow becomes a mandatory dependency of Pandas 3.0.0, that dependency will be a new package that doesn't exists today basically around
libarrow
calledpyarrow-minimal
, that will be much much smaller (in size) and portable (in terms of CPU architectures, that may narrow the gap to Numpy current availability in that matter) and will be released with PyArrow 15.
This is not exactly the case. Let me expand a little on what is happening at the moment:
The Arrow team did release Arrow
and pyarrow
15.0.0 a couple of weeks ago. There is some ongoing work and efforts from the Arrow community in reducing the footprint of minimal builds of Arrow. At the moment there is an opened PR on the conda feedstock for Arrow, which I am working on, to be able to have several different installations for pyarrow
. Based on review and design discussions it seems there will be pyarrow-core
, pyarrow
and pyarrow-all
with different subsets of features and sizes.
There is no change about the current CPU architectures supported but please if your system is not supported you can always open an issue or a feature request to the Arrow repository.
We still have to plan and do the work for published wheels on PyPI but this still requires planning and contributors to actively work on. Some issues that are related: https://github.com/apache/arrow/issues/24688
We still have to plan and do the work for published wheels on PyPI but this still requires planning and contributors to actively work on. Some issues that are related: apache/arrow#24688
For the purpose of being able to package PyArrow in smaller wheels, I had created https://github.com/amol-/consolidatewheels but it would require some real world testing. https://github.com/amol-/wheeldeps was created as an example, but the more testing we can get, the faster we will be able to split pyarrow wheels
Bueno sí algunos le preocupa la ram, lo optimo son 16 gb para hacer trabajos solidos, pero bueno cada uno ve su alcance con su cliente.
This is not feedback on the decision to include pyarrow as a dependency, but rather on the usage of deprecation warnings to solicit feedback. Raising deprecation warnings (especially in the main __init__.py
) adds a lot of noise to downstream projects. It also creates a development burden for packages whose CI treats warnings as errors (see for example https://github.com/bokeh/bokeh/issues/13656 and https://github.com/zapatacomputing/orquestra-cirq/pull/53). Ideally deprecation warnings would be reserved for changes to the pandas public API and be raised only in the code paths affected by the change.
This is not feedback on the decision to include pyarrow as a dependency, but rather on the usage of deprecation warnings to solicit feedback. [...] Ideally deprecation warnings would be reserved for changes to the pandas public API and be raised only in the code paths affected by the change.
Even by your own logic, including a warning was the right choice. The inclusion of PyArrow will come with a major change to the pandas public API: "Starting in pandas 3.0, the default type inferred for string data will be ArrowDtype with pyarrow.string instead of object." (quoted from the PDEP)
However, I think a FutureWarning, as originally was proposed in the PDEP, would have made more sense than the DeprecationWarning that was implemented.
Regardless, if the deprecation warning creates issues for you, you can just install PyArrow to make it go away. If installing PyArrow would create issues for you, that's what this issue is for. Considering the change can cause CI failures, the warning preemptively causing CI failures seems like the lesser of two bad options.
Bias disclosure: I'm impacted negatively by the upcoming change.
I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow
as they don't get packaged like CentOS 7
and ubuntu 18.04 LTS
the first thought that arose was to replace pandas with another similar tool
I dislike the process here, and I don't mean the dep warning.
I appreciate the message and asking for feedback, but it went out to everyone and that will include people like me who have no idea what's going on. It is generally your business how you run your project (Thank you for your work and software), but if you do want feedback and if you do want to be inclusive, please think about how you are onboarding to this issue.
Generally, complexity is bad and changing things in bad, because there is the risk of new errors. So you are starting at a negative score in my book and this whole thing would require a significant gain and not just a neutral tradeoff between increased size and some performance.
(I think there is a general blindness in this respect from package maintainers, because you are working with this every day and you think some increase in complexity is acceptable for [reasons] and this continues for decades and then you have a bloated mess.)
Does it have to be done this way, can't you create a new package that uses the advantages of both packages and overrides the original function? Then if people want to they can use both and it leaves the original thing untouched. Maybe put a note into the docs pointing to the optimization.
- Surely you have discussed doing this and there are pros and cons for this. Please link to that discussion.
The discussion is linked in the PDEP itself - https://github.com/pandas-dev/pandas/pull/52711
I know this isn't super relevant to the discussion, but I want to throw this out here anyway. Sometimes, even a harmless change like displaying a DeprecationWarning
can have undesired repercussions.
I teach Python courses for programming beginners, and since the 2.2.0 release I've received many questions and messages from students confused by the warning. They are left wondering if they installed pandas correctly, if they need to install something called "arrow", or whether they can continue the course at all.
Yes, I know the students should eventually get used to warning messages, and this discussion is definitely relevant to the Data Science community. But realistically, 99% of the people to ever import pandas as pd
will never come remotely close to it.
As stated previously, if pyarrow
ever becomes a dependency of pandas
(disregarding whether that's a good or a bad thing), the vast majority of users shouldn't even notice any difference. Everything should "just work" when they type pip install pandas
. As a result, I find the decision to display a DeprecationWarning
to the entire user base upon importing pandas unfortunate.
Well, I think all these contributions for the discussion end up being useful for the community as a whole.
Maybe developers may consider another approach regarding communication of deprecation:
There is no perfect solution to deal with the current situation, but I'm positive PyArrow will bring very good benefits for Pandas in the future! 🙂
I want to follow up on https://github.com/pandas-dev/pandas/issues/54466#issuecomment-1906654486 from above about a pyarrow
extra. The message just says that you need to have "Pyarrow". It would be better if it suggested installing pandas[feather]
(or pandas[pyarrow]
if feather does not just mean pyarrow
). Adding transitive dependencies to a project''s dependency list should be avoided if possible. From the warning message, it seems that the suggested solution is to add pyarrow
to your dependency list.
Also, since the warning directs users to this issue, it would be nice if the issue description were edited to include suggestions on how to avoid it -- both whether to add pyarrow
to your dependencies or use pandas[feather]
and also the filterwarnings
solution.
I agree @wshanks, I opened https://github.com/pandas-dev/pandas/pull/57284 to introduce that extra. If people like it, I can add a docs entry for Pandas 2.2.1
This change is making a mess in CI jobs. Suppressing the warning as suggested in https://github.com/pandas-dev/pandas/issues/54466#issuecomment-1919988166 is not a viable solution and I could not even find a robust way to code "exclude Pandas versions >=2.2 AND < 3" as a requirement specifier in pyproject.toml
.
This is not feedback on the decision to include pyarrow as a dependency, but rather on the usage of deprecation warnings to solicit feedback. [...] Ideally deprecation warnings would be reserved for changes to the pandas public API and be raised only in the code paths affected by the change.
Even by your own logic, including a warning was the right choice. The inclusion of PyArrow will come with a major change to the pandas public API: "Starting in pandas 3.0, the default type inferred for string data will be ArrowDtype with pyarrow.string instead of object." (quoted from the PDEP)
However, I think a FutureWarning, as originally was proposed in the PDEP, would have made more sense than the DeprecationWarning that was implemented.
Regardless, if the deprecation warning creates issues for you, you can just install PyArrow to make it go away. If installing PyArrow would create issues for you, that's what this issue is for. Considering the change can cause CI failures, the warning preemptively causing CI failures seems like the lesser of two bad options.
Bias disclosure: I'm impacted negatively by the upcoming change.
I agree that including a warning for string type inference makes sense. However I'm not sure that the main __init__.py
is the best place for this warning because it creates noise for projects that do not depend on string type inference and therefore may not be affected by the change.
Also I understand that the warning can be suppressed by installing PyArrow. The point is that any approach to suppressing the warning requires a certain amount of knowledge and effort. I'm thinking for example of the questions that @jfaccioni-asimov gets from confused students.
When switching to pyarrow
for the string dtype, it would be good if some of the existing performance issues with the string dtype are addresses beforehand. Currently (pandas
2.2.0), string[pyarrow]
is the slowest solution for some tasks:
import pandas as pd
import timeit
points = 1000000
data = [f"data-{n}" for n in range(points)]
for dtype in ["object", "string", "string[pyarrow]"]:
index = pd.Index([f"index-{n}" for n in range(points)], dtype=dtype)
df = pd.DataFrame(data, index=index, dtype=dtype)
print(dtype)
%timeit df.loc['index-2000']
which returns
object
9.78 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
string
15.7 µs ± 36.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
string[pyarrow]
17.6 µs ± 66.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
I'm a contributor to Panel by HoloViz.
Pandas is used extensively in the HoloViz ecosystem. Its a hard requirement of Panel.
Usage in pyodide and pyscript has really benefitted us a lot. It has made our docs interactive and enabled our users to share live Python GUI applications in the browser without having to fund and manage a server.
As far as I can see Pyarrow does not work with pyodide. I.e. Pandas would no longer work in Pyodide? I.e. Panel would no longer work in Pyodide?
Thinking outside of HoloViz Panel I believe that making Pandas unusable in Pyodide or increasing the download time risks making all gains of Python in the browser with Pyodide and Pyscript unusable.
Thanks for asking for feedback. Thanks for Pandas.
There is ongoing work about Pyarrow support in Pyodide, for example see https://github.com/pyodide/pyodide/issues/2933. If I try to use my crystal ball, my guess is that pandas developers have this in mind. Also even in the case of pandas 3.0 go out, require Pyarrow and Pyarrow support is still not there in Pyodide, you will always be able to use older pandas versions in Pyodide so unless you need a pandas 3.0 feature, you will be fine.
Thx @lesteve .
These issues are not limited to Panel. They will limit entire PyData ecosystem using pyodide to make their docs interactive without spending huge amounts on servers. They will also limit Streamlit (Stlite), Gradio (Gradiolite), Jupyterlite, PyScript etc. running in the browser. Which is where the next 10 million Python users are expected to come from.
Are there 3 distinct arrow string types in pandas?
Is the default going to be string[pyarrow_numpy]? What are the differences between the 3 string datatypes and when should 1 be used over the other? Do they all perform the same because they use the same arrow memory layout and compute kernels?
is there a way to silence this warning?
You can do it with the stdlib
warnings.filterwarning
module:>>> import warnings >>> warnings.filterwarnings("ignore", "\nPyarrow", DeprecationWarning) >>> import pandas
(unfortunately it currently doesn't work as
-W
command line argument or pytest config option, see #57082)
If you're using pytest and the warnings are polluting your CI pipelines, you can ignore this warning by editing your pytest.ini like so:
[pytest]
filterwarnings =
ignore:\nPyarrow:DeprecationWarning
Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible. (System packages are not allowed to depend on non-system packages.)
I don't know whether creating a system package of pyarrow is possible with reasonable effort, or whether this would make the system pandas packages impossible to update (and eventually require their removal when old pandas was no longer compatible with current Python/numpy).
is there a way to silence this warning?
You can do it with the stdlib
warnings.filterwarning
module:>>> import warnings >>> warnings.filterwarnings("ignore", "\nPyarrow", DeprecationWarning) >>> import pandas
(unfortunately it currently doesn't work as
-W
command line argument or pytest config option, see #57082)
I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow as they don't get packaged like CentOS 7 and ubuntu 18.04 LTS
I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow as they don't get packaged like CentOS 7 and ubuntu 18.04 LTS
I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow as they don't get packaged like CentOS 7 and ubuntu 18.04 LTS
I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow as they don't get packaged like CentOS 7 and ubuntu 18.04 LTS
FYI, I've added pyarrow dep on 2024-01-20 to the Gentoo ebuild and requested testing on the architectures we support. So far it's looking grim — no success on ARM, AArch64, PowerPC, X86. I feel like I'm now being made responsible for fixing Arrow, that doesn't seem to be very portable in itself.
Arrow, that doesn't seem to be very portable in itself.
We build arrow and run the test suite successfully on all the mentioned architecture in conda-forge, though admittedly the stack of dependencies is pretty involved (grpc, protobuf, the major cloud SDKs, etc.). Feel free to check out our recipe if you need some inspiration, or open an issue on the feedstock if you have some questions.
Dear maintainers and core devs,
thank you for making Pandas available to the community. Since you ask for feedback, here's my humble opinion
As a longtime user and developer of open-source libraries which depend on Pandas, I mostly deal with (possibly) large Dataframes with homogeneous dtype (np.float64
), and I treat them (for the most part) as wrapper around the corresponding Numpy 2-dimensional Arrays. The reason I use Pandas Dataframes as opposed to plain Numpy Arrays is that I find Pandas indexing capabilities to be its "killer" feature, it's much safer from my point of view to keep track of indexing in Pandas rather than Numpy, especially when considering Datetime indexes or multi-indexes. The same applies to Series and Numpy 1-dimensional Arrays.
I have no objections to using Arrow as back-end to store string, object dtypes, or in general non-homogeneous dtype Dataframes.
I would like, however, to hear whether you plan to switch away from Numpy as one of the core back-ends (in my usecases, the most important one). This is relevant for various reasons, including memory management. It would be great to know if in the future one will have to worry that manipulating large 2-dimensional Numpy Arrays of floats by casting them as Dataframes will involve a conversion into Arrow, and back to Numpy (if then I want them back as such). That would be very problematic, since it involves a whole new layer of complexity.
Thanks, Enzo
This is an issue to collect feedback on the decision to make PyArrow a required dependency and to infer strings as PyArrow backed strings by default.
The background for this decision can be found here: https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html
If you would like to filter this warning without installing pyarrow at this time, please view this comment: https://github.com/pandas-dev/pandas/issues/54466#issuecomment-1919988166