kieferk / dfply

dplyr-style piping operations for pandas dataframes
GNU General Public License v3.0
889 stars 103 forks source link

Unable to limit the output from head after group_by #48

Open GMAravind opened 6 years ago

GMAravind commented 6 years ago

Head operator works well when the dataset was initially loaded.

df >> head(5) - returned exactly 5 rows

I applied a group_by on the dataframe and saved it to the same df = df >> group_by(X.Team_Name) >> mutate(bat_avg = X.Hits.sum()/X.Bats.sum())

While printing the head of the updated dataframe df >> head(5) - prints the entire dataframe instead of just printing the first five rows.

P.S : A big shout to the amazing work which went behind this package, using this saved me a lot of time. Thanks!

kieferk commented 6 years ago

Hmm yes that does appear to be a bug. I will have to poke around and see what's going on there.

If you want just the first 5 rows (as opposed to the first 5 per group), you can use the ungroup command and it should work:

df >> ungroup() >> head(5)
kieferk commented 6 years ago

Wait nevermind I was wrong, it does produce the intended behavior. Right now if you group_by and then issue a subsetting command like head or tail, it will be applied to the groups then re-merged. See this with the "cut" field of diamonds data below:

In [5]: d >> head(2)
Out[5]: 
    carat        cut color clarity  depth  table  price     x     y     z
8    0.22       Fair     E     VS2   65.1   61.0    337  3.87  3.78  2.49
91   0.86       Fair     E     SI2   55.1   69.0   2757  6.45  6.33  3.52
2    0.23       Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
4    0.31       Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75
0    0.23      Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
11   0.23      Ideal     J     VS1   62.8   56.0    340  3.93  3.90  2.46
1    0.21    Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
3    0.29    Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
5    0.24  Very Good     J    VVS2   62.8   57.0    336  3.94  3.96  2.48
6    0.24  Very Good     I    VVS1   62.3   57.0    336  3.95  3.98  2.47

Not sure why you're getting the entire dataset output with head(5). Are there only <= 5 rows per team name in your dataset?

GMAravind commented 6 years ago

The original data has more than 5 records per team. image

Once grouped by the team name, the rolled up dataset would have only one record per team name. df_grouped = df >> group_by(X.Team_Name) >> mutate(bat_avg = X.Hits.sum()/X.Bats.sum())

df_grouped >> head(5) I would like to print the first 5 records of the df_grouped which actually prints the entire df_grouped data frame. I checked again if df_grouped >> head(5) is printing the entire data frame and it does.

kieferk commented 6 years ago

Interesting. Are you using the most recent version of the package? If you clone the repo and reinstall does it still happen?

Also does df_grouped >> ungroup() >> head(5) solve your problem?

kieferk commented 6 years ago

OK I found the dataset you're working with in this notebook: https://github.com/vineettanna/Package-dfply-Exploration/blob/master/dfply_exploration.ipynb

Funny enough while answering this question I found a bug in if_else. Go figure... Anyway, as far as I can tell it is behaving normally, it just seems like it is printing the whole dataframe.

So after loading it in I can set up the same grouped dataframe:

baseball = pd.read_csv("baseball.csv",names = ["Player_Name","Team_Name","Bats","Hits"])
df_grouped = baseball >> group_by(X.Team_Name) >> mutate(bat_avg = X.Hits.sum()/X.Bats.sum())

If you don't ungroup before using head(5) it is going to output the first 5 rows per group. There are quite a few groups so it may seem like the whole dataset, but in fact it is not. For example, look at the shape of the original dataframe:

baseball.shape
> (144, 4)

And then the shape of the head(5) command applied to the grouped dataframe:

grouped_head = df_grouped >> head(5)
grouped_head.shape
> (125, 5)

The original has 144 observations but grouped_head only has 125. This is because it is performing the expected behavior: taking at most the first 5 observations per group.

See here:

baseball >> (group_by(X.Team_Name) >> 
    summarize(player_count = n_distinct(X.Player_Name)) >>
    mutate(cap_count = if_else(X.player_count > 5, 5, X.player_count)) >> 
    ungroup() >> 
    summarize(all_count_sum=np.sum(X.player_count), cap_count_sum=np.sum(X.cap_count)))

   all_count_sum  cap_count_sum
0            144            125