Open GMAravind opened 6 years ago
Hmm yes that does appear to be a bug. I will have to poke around and see what's going on there.
If you want just the first 5 rows (as opposed to the first 5 per group), you can use the ungroup command and it should work:
df >> ungroup() >> head(5)
Wait nevermind I was wrong, it does produce the intended behavior. Right now if you group_by
and then issue a subsetting command like head
or tail
, it will be applied to the groups then re-merged. See this with the "cut" field of diamonds data below:
In [5]: d >> head(2)
Out[5]:
carat cut color clarity depth table price x y z
8 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
91 0.86 Fair E SI2 55.1 69.0 2757 6.45 6.33 3.52
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
11 0.23 Ideal J VS1 62.8 56.0 340 3.93 3.90 2.46
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
5 0.24 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
6 0.24 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47
Not sure why you're getting the entire dataset output with head(5)
. Are there only <= 5 rows per team name in your dataset?
The original data has more than 5 records per team.
Once grouped by the team name, the rolled up dataset would have only one record per team name. df_grouped = df >> group_by(X.Team_Name) >> mutate(bat_avg = X.Hits.sum()/X.Bats.sum())
df_grouped >> head(5) I would like to print the first 5 records of the df_grouped which actually prints the entire df_grouped data frame. I checked again if df_grouped >> head(5) is printing the entire data frame and it does.
Interesting. Are you using the most recent version of the package? If you clone the repo and reinstall does it still happen?
Also does df_grouped >> ungroup() >> head(5)
solve your problem?
OK I found the dataset you're working with in this notebook: https://github.com/vineettanna/Package-dfply-Exploration/blob/master/dfply_exploration.ipynb
Funny enough while answering this question I found a bug in if_else
. Go figure... Anyway, as far as I can tell it is behaving normally, it just seems like it is printing the whole dataframe.
So after loading it in I can set up the same grouped dataframe:
baseball = pd.read_csv("baseball.csv",names = ["Player_Name","Team_Name","Bats","Hits"])
df_grouped = baseball >> group_by(X.Team_Name) >> mutate(bat_avg = X.Hits.sum()/X.Bats.sum())
If you don't ungroup before using head(5)
it is going to output the first 5 rows per group. There are quite a few groups so it may seem like the whole dataset, but in fact it is not. For example, look at the shape of the original dataframe:
baseball.shape
> (144, 4)
And then the shape of the head(5)
command applied to the grouped dataframe:
grouped_head = df_grouped >> head(5)
grouped_head.shape
> (125, 5)
The original has 144 observations but grouped_head
only has 125. This is because it is performing the expected behavior: taking at most the first 5 observations per group.
See here:
baseball >> (group_by(X.Team_Name) >>
summarize(player_count = n_distinct(X.Player_Name)) >>
mutate(cap_count = if_else(X.player_count > 5, 5, X.player_count)) >>
ungroup() >>
summarize(all_count_sum=np.sum(X.player_count), cap_count_sum=np.sum(X.cap_count)))
all_count_sum cap_count_sum
0 144 125
Head operator works well when the dataset was initially loaded.
df >> head(5) - returned exactly 5 rows
I applied a group_by on the dataframe and saved it to the same df = df >> group_by(X.Team_Name) >> mutate(bat_avg = X.Hits.sum()/X.Bats.sum())
While printing the head of the updated dataframe df >> head(5) - prints the entire dataframe instead of just printing the first five rows.
P.S : A big shout to the amazing work which went behind this package, using this saved me a lot of time. Thanks!