kieferk / dfply

dplyr-style piping operations for pandas dataframes
GNU General Public License v3.0
889 stars 103 forks source link

Error when groupby #15

Closed ThoDuyNguyen closed 6 years ago

ThoDuyNguyen commented 7 years ago

Below is my test data:

In[43]: payment >> head(5)
Out[43]: 

                     date      user_name game_id                channel  \
11165 2016-08-24 06:36:28  000000000000o  myfish     FB_IS_MA_AG2535_GP   
0     2016-08-02 10:14:31       00000025  myfish            google-play   
8     2016-08-02 13:18:19       00000027  myfish  Fanpage_Dailypost_APK   
10921 2016-08-23 19:48:21       00000030  myfish                 in_app   
11980 2016-08-25 11:25:29       00000030  myfish                 in_app   

        money  
11165  3000.0  
0      1000.0  
8      3000.0  
10921  3000.0  
11980  3000.0  

When I try to groupby:

payment >> head(5) >> groupby(X.user_name)
In[45]: payment >> head(5) >> groupby(X.user_name)
Traceback (most recent call last):
  File "C:\Program Files\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-45-cf2fbef85582>", line 1, in <module>
    payment >> head(5) >> groupby(X.user_name)
  File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 45, in __rrshift__
    result = self.function(other_copy)
  File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 52, in <lambda>
    return pipe(lambda x: self.function(x, *args, **kwargs))
  File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 179, in __call__
    evaluation = self.call_action(args, kwargs)
  File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 285, in call_action
    return symbolic.to_callable(symbolic_function)(self.df)
  File "C:\Program Files\Anaconda2\lib\site-packages\pandas_ply\symbolic.py", line 204, in <lambda>
    return lambda *args, **kwargs: obj._eval(dict(enumerate(args), **kwargs))
  File "C:\Program Files\Anaconda2\lib\site-packages\pandas_ply\symbolic.py", line 142, in _eval
    result = evaled_func(*evaled_args, **evaled_kwargs)
  File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 357, in wrapped
    return f(*flat_args, **kwargs)
  File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 456, in wrapped
    for arg in args[1:]]
  File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 437, in _col_ind_to_label
    raise Exception("Label not of type str or int.")
Exception: Label not of type str or int.

My data type:

In[46]: payment.dtypes
Out[46]: 

date         datetime64[ns]
user_name            object
game_id              object
channel              object
money               float64
dtype: object

The data was read from database using sqlalchemy and user_name is store as varchar.

I rechecked with the diamonds data using the same command but it works for diamond and I could not figure out why.

How could I fix the problem?

Kind regards.

bleearmstrong commented 7 years ago

What is the type of user_name? I see that it's object, but if you get the type of a single entry (e.g. type(payment.user_name[0]), it might be more specific.

Also, what is the type of the index(es)?

ThoDuyNguyen commented 7 years ago
In[4]: type(payment.user_name[0])
Out[4]: 
str

It worked in the same dataset using the original Pandas syntax

g = payment.sort_values(["user_name", "game_id", "date"]).groupby(["user_name", "game_id"])
payment["paid_time_all"] = g["date"].rank(method="first")
bleearmstrong commented 7 years ago

What are types of the indexes? (try something like payment.columns should give the metadata about the columns.

Also, can you show a complete set up as to how you got to the state you're in? e.g. reading in the data, and any manipulation you did before doing the groupby?

ThoDuyNguyen commented 7 years ago

I will include a reproducible code snippet including sample from database soon.

ThoDuyNguyen commented 7 years ago

I found out that with my dataset using column name with "" could save the problem. For example:

first_time_play >> select(X.user_name)
Traceback (most recent call last):

  File "<ipython-input-25-7b28bc213d5b>", line 1, in <module>
    first_time_play >> select(X.user_name)

  File "//anaconda/lib/python2.7/site-packages/dfply/base.py", line 45, in __rrshift__
    result = self.function(other_copy)

  File "//anaconda/lib/python2.7/site-packages/dfply/base.py", line 52, in <lambda>
    return pipe(lambda x: self.function(x, *args, **kwargs))

  File "//anaconda/lib/python2.7/site-packages/dfply/base.py", line 179, in __call__
    evaluation = self.call_action(args, kwargs)

  File "//anaconda/lib/python2.7/site-packages/dfply/base.py", line 285, in call_action
    return symbolic.to_callable(symbolic_function)(self.df)

  File "//anaconda/lib/python2.7/site-packages/pandas_ply/symbolic.py", line 204, in <lambda>
    return lambda *args, **kwargs: obj._eval(dict(enumerate(args), **kwargs))

  File "//anaconda/lib/python2.7/site-packages/pandas_ply/symbolic.py", line 142, in _eval
    result = evaled_func(*evaled_args, **evaled_kwargs)

  File "//anaconda/lib/python2.7/site-packages/dfply/base.py", line 357, in wrapped
    return f(*flat_args, **kwargs)

  File "//anaconda/lib/python2.7/site-packages/dfply/base.py", line 478, in wrapped
    for arg in args[1:]]

  File "//anaconda/lib/python2.7/site-packages/dfply/base.py", line 411, in _col_ind_to_position
    raise Exception("Column indexer not of type str or int.")

Exception: Column indexer not of type str or int.

But this one worked

first_time_play >> select("user_name")
Out[26]: 
            user_name
0            00000025
1            00000025
2        NyeinChanThu
3            00001150
4            00001373
5            00001371
6            00000449
7            00000027
kieferk commented 7 years ago

Hey - sorry I've been very busy at work and haven't checked this till now.

This is definitely odd. My initial guess would be that its checking it and the name is unicode, so then it fails since I don't have it check for unicode in there. But, you print the type and it seems to be str.

The problem looks like it's happening in the _col_ind_to_label function.

I will do some debugging of this as soon as I can, but it may be a few days. In the meantime, could you try this on the feature/collapsed-selection branch? A lot of the internal code has changed in that branch, which I am hoping to make the next generation of this package. I'm interested to see if it is an issue in that one too.

ThoDuyNguyen commented 7 years ago

I will try that branch.

Kind regards

kieferk commented 6 years ago

Closing this as the issue is for an old version of the package. If this is happening in the new v0.3.x package let me know.