Closed holgerbrandl closed 6 years ago
Can you reply with a complete reproducable example?
In other words, a snippet of code I can cut'n'paste into Python to test.
On Tue, Aug 28, 2018 at 4:11 PM Holger Brandl notifications@github.com wrote:
Consider the following example:
diamonds >> mutate(rank=min_rank(X.carat)) >> filter_by(X.rank <10)
This fails with
Traceback (most recent call last): File "
", line 1, in File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 142, in rrshift result = self.function(other_copy) File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 149, in return pipe(lambda x: self.function(x, *args, kwargs)) File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 329, in call return self.function(*args, *kwargs) File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 282, in call return self.function(df, args, kwargs) File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/subset.py", line 62, in mask if arg.dtype != bool: AttributeError: 'NotImplementedType' object has no attribute 'dtype' but seems legit to me.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kieferk/dfply/issues/65, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOypNefRA-YNbi7opG_GaheojtvzZQuks5uVV1ugaJpZM4WP0cP .
Isn't that what I did? The only think I've skipped is the from dfply import *
preamble, which I took for granted in here.
I think "rank" is a keyword. Try this:
Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
from dfply import * diamonds >> mutate(my_rank=min_rank(X.carat)) >> mask(X.my_rank < 10)
My version of dfply doesn't support filter_by(...), so I've used mask instead which is exactly equivalent.
p.s. I might be wrong, but it's usually better to include a complete reproducible example, including imports and Python version. Sometimes it's the simple things that can throw a spanner in the works.
On Wed, Aug 29, 2018 at 9:49 AM Holger Brandl notifications@github.com wrote:
Isn't that what I did? The only think I've skipped is the from dfply import * preamble, which I took for granted in here.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kieferk/dfply/issues/65#issuecomment-416875910, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOypCLIhH0H92aEaMJSGv1xPx4UK1S9ks5uVlVvgaJpZM4WP0cP .
I think it's rather a member function of pandas.DataFrame
. But when symbols are being resolved internally by dfply,
I'd expect variables to have precedence.
I'll try to submit the next ticket in a more reproducible way.
Glad its working. All the best!
On Wed, 29 Aug 2018 14:59 Holger Brandl, notifications@github.com wrote:
I think it's rather a member function of pandas.DataFrame. But when symbols are being resolved internally by dfply, I'd expect variables to have precedence.
I'll try to submit the next ticket in a more reproducible way.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kieferk/dfply/issues/65#issuecomment-416963418, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOypMVpBo4C7vx4fle9ybaPr4pDrNJfks5uVp4HgaJpZM4WP0cP .
Could you please close this issue? Thanks!
But the problem is not solved at all?! It also affects dozens of other names with happen to be used by pandas. rank
was just an example. See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html for a complete listing.
For sure @kieferk if you think it's not worth fixing or too hard, feel free to do so.
Perhaps have a more meaningful error message?
Something like "Cannot use 'rank' as a variable name as this is a reserved word.".
On Sun, 2 Sep 2018 14:06 Holger Brandl, notifications@github.com wrote:
But the problem is not solved at all?! It also affects dozens of other names with happen to be used by pandas. rank was just an example. See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html for a complete listing.
For sure @kieferk https://github.com/kieferk if you think it's not worth fixing or too hard, feel free to do so.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kieferk/dfply/issues/65#issuecomment-417929588, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOypFFmpqvj1NGLOdYyFGedul7AQtq5ks5uW9ekgaJpZM4WP0cP .
This, or giving column names priority over pandas functions when resolving X.foo
. The latter seems more correct to me, but I haven't used dfply
much yet.
I'm open to fixing this if possible, but it's tricky. The X
symbol is just a generic instance of the Intention
class, and as such is at some point evaluated against a "context" object. If the context passed is a pandas DataFrame, which is typically the case, it will apply the function to that DataFrame. The function in this case would be the __getattr__
call for foo
(or rank
, or whatever it may be).
The ugly way to deal with this would be to do a check on the context object before it's sent to the function and have special logic in place to "override" the pandas behavior. To be honest I'm not really keen on doing that. Pandas would expect you to access your variable by string name in the case that it duplicates a built-in function, and so I'd advise you to do the same. For example:
from dfply import *
diamonds >> head()
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
diamonds >> select(X['cut']) >> head()
cut
0 Ideal
1 Premium
2 Good
3 Premium
4 Good
In your case of course you would have 'rank'
instead of 'cut'
.
Perhaps just give a meaningful error in this case?
@kieferk thanks for the details. I did not know about the X['rank']
way of accessing the columns, which is a reasonable/readable way of doing it. I initially thought that it would not be possible to use names such rank
for columns at all.
Thanks both of you for your help.
Consider the following example:
This fails with
but seems legit to me.