tidypyverse / tidypandas

A grammar of data manipulation for pandas inspired by tidyverse
https://tidypyverse.github.io/tidypandas/
MIT License
91 stars 7 forks source link

[bug] mutate provides not intuitive error messages #20

Closed talegari closed 2 years ago

talegari commented 2 years ago

With version v0.2.1:

>>> from palmerpenguins import load_penguins
>>> pen = tidyframe(load_penguins())
>>> pen
# A tidy dataframe: 344 X 8
   species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  ...
  <object>   <object>       <float64>      <float64>          <float64>    <float64>  ...
0   Adelie  Torgersen            39.1           18.7              181.0       3750.0  ...
1   Adelie  Torgersen            39.5           17.4              186.0       3800.0  ...
2   Adelie  Torgersen            40.3           18.0              195.0       3250.0  ...
3   Adelie  Torgersen             NaN            NaN                NaN          NaN  ...
4   Adelie  Torgersen            36.7           19.3              193.0       3450.0  ...
5   Adelie  Torgersen            39.3           20.6              190.0       3650.0  ...
6   Adelie  Torgersen            38.9           17.8              181.0       3625.0  ...
7   Adelie  Torgersen            39.2           19.6              195.0       4675.0  ...
8   Adelie  Torgersen            34.1           18.1              193.0       3475.0  ...
9   Adelie  Torgersen            42.0           20.2              190.0       4250.0     
#... with 334 more rows, and 2 more columns: sex <object>, year <int64>
  1. These three are valid and equivalent:
    pen.mutate({'ldm': ("np.maximum", ['bill_depth_mm', 'bill_length_mm'])})
    pen.mutate({'ldm': (lambda x, y: np.maximum(x, y), ['bill_depth_mm', 'bill_length_mm'])})
    pen.mutate({'ldm': lambda x: np.maximum(x['bill_depth_mm'], x['bill_length_mm'])})

Now,

>>> pen.mutate({'ldm': (np.maximum, ['bill_depth_mm', 'bill_length_mm'])})
TypeError: unsupported callable

is not a valid call as the functionnp.maximum does not take arguments x and y. But this should provide a clearer error message.

  1. For the grouped case, we get a different error, again not intuitive:
    >>> pen.mutate({'ldm': (np.maximum, ['bill_depth_mm', 'bill_length_mm'])}, by = 'species')
    AssertionError: arg 'column_names' should contain valid column namesThese column(s) do not exist: ['HuPVeoasPkPuWIkSMQhM']

Solution (draft): Check for the args when number of args >= 2 and see if they are named appropriately. If not, throw a meaningful error right away. @grahitr what do you think?

grahitr commented 2 years ago

@talegari On debugging, I realized that the real cause of the error is _is_kwargable check(in turn a call to inspect.getfullargspec) in mutate. All the numpy functions and python builtin written directly in C throw the same exception with inspect.getfullargspec. https://stackoverflow.com/questions/27769462/using-inspect-getargfullspec-to-find-out-about-functions-not-working

np.maximum works on the two arrays of equal sizes, actual names of two arrays in function signature is not source of the error.