kieferk / dfply

dplyr-style piping operations for pandas dataframes
GNU General Public License v3.0
890 stars 103 forks source link

Can not resolve column names that are also functions in the environment #65

Closed holgerbrandl closed 6 years ago

holgerbrandl commented 6 years ago

Consider the following example:

diamonds >> mutate(rank=min_rank(X.carat)) >> filter_by(X.rank <10)

This fails with

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 142, in __rrshift__
    result = self.function(other_copy)
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 149, in <lambda>
    return pipe(lambda x: self.function(x, *args, **kwargs))
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 329, in __call__
    return self.function(*args, **kwargs)
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 282, in __call__
    return self.function(df, *args, **kwargs)
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/subset.py", line 62, in mask
    if arg.dtype != bool:
AttributeError: 'NotImplementedType' object has no attribute 'dtype'

but seems legit to me.

sharpe5 commented 6 years ago

Can you reply with a complete reproducable example?

In other words, a snippet of code I can cut'n'paste into Python to test.

On Tue, Aug 28, 2018 at 4:11 PM Holger Brandl notifications@github.com wrote:

Consider the following example:

diamonds >> mutate(rank=min_rank(X.carat)) >> filter_by(X.rank <10)

This fails with

Traceback (most recent call last): File "", line 1, in File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 142, in rrshift result = self.function(other_copy) File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 149, in return pipe(lambda x: self.function(x, *args, kwargs)) File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 329, in call return self.function(*args, *kwargs) File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 282, in call return self.function(df, args, kwargs) File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/subset.py", line 62, in mask if arg.dtype != bool: AttributeError: 'NotImplementedType' object has no attribute 'dtype'

but seems legit to me.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kieferk/dfply/issues/65, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOypNefRA-YNbi7opG_GaheojtvzZQuks5uVV1ugaJpZM4WP0cP .

holgerbrandl commented 6 years ago

Isn't that what I did? The only think I've skipped is the from dfply import * preamble, which I took for granted in here.

sharpe5 commented 6 years ago

I think "rank" is a keyword. Try this:

Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32

from dfply import * diamonds >> mutate(my_rank=min_rank(X.carat)) >> mask(X.my_rank < 10)

My version of dfply doesn't support filter_by(...), so I've used mask instead which is exactly equivalent.

p.s. I might be wrong, but it's usually better to include a complete reproducible example, including imports and Python version. Sometimes it's the simple things that can throw a spanner in the works.

On Wed, Aug 29, 2018 at 9:49 AM Holger Brandl notifications@github.com wrote:

Isn't that what I did? The only think I've skipped is the from dfply import * preamble, which I took for granted in here.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kieferk/dfply/issues/65#issuecomment-416875910, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOypCLIhH0H92aEaMJSGv1xPx4UK1S9ks5uVlVvgaJpZM4WP0cP .

holgerbrandl commented 6 years ago

I think it's rather a member function of pandas.DataFrame. But when symbols are being resolved internally by dfply, I'd expect variables to have precedence.

I'll try to submit the next ticket in a more reproducible way.

sharpe5 commented 6 years ago

Glad its working. All the best!

On Wed, 29 Aug 2018 14:59 Holger Brandl, notifications@github.com wrote:

I think it's rather a member function of pandas.DataFrame. But when symbols are being resolved internally by dfply, I'd expect variables to have precedence.

I'll try to submit the next ticket in a more reproducible way.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kieferk/dfply/issues/65#issuecomment-416963418, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOypMVpBo4C7vx4fle9ybaPr4pDrNJfks5uVp4HgaJpZM4WP0cP .

sharpe5 commented 6 years ago

Could you please close this issue? Thanks!

holgerbrandl commented 6 years ago

But the problem is not solved at all?! It also affects dozens of other names with happen to be used by pandas. rank was just an example. See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html for a complete listing.

For sure @kieferk if you think it's not worth fixing or too hard, feel free to do so.

sharpe5 commented 6 years ago

Perhaps have a more meaningful error message?

Something like "Cannot use 'rank' as a variable name as this is a reserved word.".

On Sun, 2 Sep 2018 14:06 Holger Brandl, notifications@github.com wrote:

But the problem is not solved at all?! It also affects dozens of other names with happen to be used by pandas. rank was just an example. See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html for a complete listing.

For sure @kieferk https://github.com/kieferk if you think it's not worth fixing or too hard, feel free to do so.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kieferk/dfply/issues/65#issuecomment-417929588, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOypFFmpqvj1NGLOdYyFGedul7AQtq5ks5uW9ekgaJpZM4WP0cP .

holgerbrandl commented 6 years ago

This, or giving column names priority over pandas functions when resolving X.foo. The latter seems more correct to me, but I haven't used dfply much yet.

kieferk commented 6 years ago

I'm open to fixing this if possible, but it's tricky. The X symbol is just a generic instance of the Intention class, and as such is at some point evaluated against a "context" object. If the context passed is a pandas DataFrame, which is typically the case, it will apply the function to that DataFrame. The function in this case would be the __getattr__ call for foo (or rank, or whatever it may be).

The ugly way to deal with this would be to do a check on the context object before it's sent to the function and have special logic in place to "override" the pandas behavior. To be honest I'm not really keen on doing that. Pandas would expect you to access your variable by string name in the case that it duplicates a built-in function, and so I'd advise you to do the same. For example:

from dfply import *

diamonds >> head()
   carat      cut color clarity  depth  table  price     x     y     z
0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75

diamonds >> select(X['cut']) >> head()
       cut
0    Ideal
1  Premium
2     Good
3  Premium
4     Good

In your case of course you would have 'rank' instead of 'cut'.

sharpe5 commented 6 years ago

Perhaps just give a meaningful error in this case?

holgerbrandl commented 6 years ago

@kieferk thanks for the details. I did not know about the X['rank'] way of accessing the columns, which is a reasonable/readable way of doing it. I initially thought that it would not be possible to use names such rank for columns at all.

Thanks both of you for your help.