pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

BUG: clip does not accept numpy's arguments #35321

Open vfilimonov opened 4 years ago

vfilimonov commented 4 years ago

Code Sample, a copy-pastable example

x = pd.Series(np.random.randn(100))
x.clip(lower=0)         # Works
x.clip(lower=0, min=0)  # TypeError: clip() got an unexpected keyword argument 'min'

Problem description

Documentation of clip (dataframe, series) states the method signature:

Series.clip(self: ~ FrameOrSeries, lower=None, upper=None, axis=None, inplace: bool = False, *args, **kwargs)

and explains about **kwargs:

*args, **kwargs
Additional keywords have no effect but might be accepted for compatibility with numpy.

However min or max arguments of numpy are not accepted and raised TypeError.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.1.final.0 python-bits : 64 OS : Darwin OS-release : 18.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.0.5 numpy : 1.18.2 pytz : 2019.3 dateutil : 2.8.1 pip : 20.1.1 setuptools : 42.0.1.post20191125 Cython : None pytest : 5.3.0 hypothesis : None sphinx : 2.2.0 blosc : None feather : None xlsxwriter : None lxml.etree : 4.4.1 html5lib : None pymysql : 0.9.3 psycopg2 : 2.8.5 (dt dec pq3 ext lo64) jinja2 : 2.10.1 IPython : 7.13.0 pandas_datareader: None bs4 : 4.6.3 bottleneck : None fastparquet : 0.3.2 gcsfs : None lxml.etree : 4.4.1 matplotlib : 3.1.3 numexpr : 2.7.0 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.17.1 pytables : None pytest : 5.3.0 pyxlsb : None s3fs : 0.4.2 scipy : 1.5.1 sqlalchemy : 1.3.13 tables : 3.4.4 tabulate : 0.8.7 xarray : 0.15.1 xlrd : 1.1.0 xlwt : None xlsxwriter : None numba : 0.46.0
jreback commented 4 years ago

and how exactly does numpy accept min as an arg?

vfilimonov commented 4 years ago

Hello, @jreback

numpy's clip operates with min and max instead of lower and upper:

x = np.random.randn(100)
x.clip(min=0)

Perhaps pandas' **kwargs are not meant for this purpose - but it is not clear from the docs, what exactly does "compatibility with numpy" means.

The way how I stumbled across it - is trying to have some function to behave identically for pandas' and numpy's input. I ended up having the following ugly workaround:

def clip(x, lower=None, upper=None):
    """ Unified interface for pandas' and numpy's clip """
    try:
        return x.clip(lower=lower, upper=upper)
    except ValueError:
        if lower is None and upper is None:
            return x
        return x.clip(min=lower, max=upper)
jreback commented 4 years ago

show me numpy docs i couldn't see this parameter

vfilimonov commented 4 years ago

It's here: numpy.ndarray.clip

But eventually numpy itself is not consistent and numpy.clip operates with a_min and a_max instead.

jreback commented 4 years ago

right that's what i saw a_max/a_min

so i think of you use that it would work

vfilimonov commented 4 years ago

It looks like a_min/a_max are also rejected by pandas:

x = pd.Series(np.random.randn(100))
x.clip(a_min=0)  # TypeError: clip() got an unexpected keyword argument 'a_min'
x.clip(a_max=0)  # TypeError: clip() got an unexpected keyword argument 'a_max'

If fixing this and allowing pandas to accept numpy's argument in clip I would vote to keep the min/max arguments of numpy.ndarray.clip() rather than a_min and a_max of numpy.clip(). This way it will keep the polymorphic behavior as both of pandas.Series/DataFrame and numpy.ndarray could then accept the same arguments:

x_np = np.random.randn(100)
x_pd = pd.Series(x_np)

x_np.clip(min=0)  
x_pd.clip(min=0)

Another option is indeed to remove the **kwargs from the docs and do not support such compatibility, since the parameters are ambiguous for numpy.

simonjayhawkins commented 4 years ago

If fixing this and allowing pandas to accept numpy's argument in clip I would vote to keep the min/max arguments of numpy.ndarray.clip() rather than a_min and a_max of numpy.clip().

it looks like that when the np.clip function dispatches to Series.clip, the arguments are passed as positional arguments and hence a_min/a_max do not get passed on to Series.clip

np.clip(ser, a_min=1, a_max=8)
> c:\users\simon\pandas\pandas\core\generic.py(7316)clip()
-> inplace = validate_bool_kwarg(inplace, "inplace")
(Pdb) a
self = 0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32
lower = 1
upper = 8
axis = None
inplace = False
args = ()
kwargs = {'out': None}

and hence the following works as expected.

>>> pd.__version__
'1.1.0rc0'
>>>
>>> arr = np.arange(10)
>>> arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>>
>>> ser = pd.Series(arr)
>>> ser
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32
>>>
>>> np.clip(ser, a_min=1, a_max=8)
0    1
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    8
dtype: int32
>>>
>>> np.clip(ser, 1, 8)
0    1
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    8
dtype: int32
>>>
>>> np.clip(a_max=8, a_min=1, a=ser)
0    1
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    8
dtype: int32
>>>

w.r.t the min/max keywords of ndarray.clip, unlike np.clip, these arguments are optional, however at least one of max or min must be given.

I think that there could be an argument for deprecating the current keywords in Series.clip, upper/lower in favour of min/max.

we have a deprecate_kwarg decorator specifically for this purpose.

further investigation and PRs welcome.

simonjayhawkins commented 4 years ago

The way how I stumbled across it - is trying to have some function to behave identically for pandas' and numpy's input. I ended up having the following ugly workaround:

does passing the arguments as positional arguments work? ndarray.clip and Series.clip have the first two arguments in the same order.

or use np.clip.