pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.59k stars 17.57k forks source link

BUG: `Series.clip` does not work with scalar numpy arrays. #59053

Open randolf-scholz opened 1 week ago

randolf-scholz commented 1 week ago

Pandas version checks

Reproducible Example

import numpy as np
import pandas as pd
pd.Series([-1,2,3]).clip(lower=np.array(0))

Results in TypeError: len() of unsized object.

Issue Description

The following line tries to compute len(other), but scalar arrays have no len.

https://github.com/pandas-dev/pandas/blob/c46fb76afaf98153b9eef97fc9bbe9077229e7cd/pandas/core/series.py#L5892-L5894

If we remove these two lines, the above example produces the expected result, and still errors as expected if e.g. a list of incorrect size is passed.

Expected Behavior

Scalar arrays should be treated like scalars.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.7.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-41-generic Version : #41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 3 11:32:55 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 2.0.0 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 70.1.0 pip : 24.0 Cython : None pytest : 8.2.2 hypothesis : 6.103.2 sphinx : 7.3.7 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.25.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.0 gcsfs : None matplotlib : 3.9.0 numba : None numexpr : None odfpy : None openpyxl : 3.1.4 pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
rhshadrach commented 1 week ago

Thanks for the report. This does indeed appear to me to be an issue, but I wonder if this is wide-spread throughout pandas and what the ramifications of trying to fix this systematically would be. E.g.

from pandas._libs import lib

print(lib.is_scalar(np.array(0)))
# False

Further investigations are welcome!

jbrockmendel commented 1 week ago

Lib.itemfromzerodim

Edit [rhshadrach]: lib.item_from_zerodim

randolf-scholz commented 1 week ago

I think there are two ways to handle it:

  1. Consider only objects that are scalars.
  2. Consider objects that can be interpreted as scalars.

Regarding the latter, any element of a 1-dimensional vector space can be considered a scalar, since in this case the vector space and its base field are isomorphic. Towards this end, numpy, and many other libraries, offer the .item() function, which returns a scalar if the array contains exactly one element (although it doesn't seem part of the python Array API currently).

pandas._libs.lib.is_scalar seems to be in line here with numpy.isscalar, which also returns false for np.array(0), as technically, this is considered a 0-dimensional array and hence not a scalar.

If (1) is preferred by the maintainers, this issue can probably be closed. However, numpy.clip does support passing 0-dimensional arrays, and so does Series.where, which can be used to implement Series.clip:

import numpy as np
import pandas as pd
s = pd.Series([-1,2,3])
s_clipped = s.where(s>np.array(0), np.array(0))
pd.testing.assert_series_equal(s_clipped, s.clip(lower=0))  # ✅

Whether one wants to go with option ① or ② is probably just a matter of taste/design, but using this choice consistently throughout the API seems desirable.

rhshadrach commented 1 week ago

but using this choice consistently throughout the API seems desirable.

Right - I'm not sure how well this is supported throughout pandas. You mentioned clip, but there are a number of other methods that take scalars like this I think. It seem to me the next steps are to determine which methods support this, and from that we can find a reasonable way to achieve consistency.