tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.39k stars 418 forks source link

separate doesn't work on first grouped column when more than one column in group_by #177

Closed benmarwick closed 8 years ago

benmarwick commented 8 years ago

In a dataframe where two column names are passed to group_by, the separate function will not find the first column name (but will find the second one).

Here's an example...

df <- data.frame(the_num = 1:30,
        the_chr1 = rep(sapply(1:10, function(i) paste(sample(c(letters,LETTERS),3),collapse="")),3),
        the_chr2 = sapply(1:30, function(i) paste(sample(c(letters,LETTERS),3),collapse="")))
head(df)
  the_num the_chr1 the_chr2
1       1      Fiq      LMH
2       2      Ozf      hdv
3       3      NVK      ROc
4       4      IRe      HpE
5       5      Aeq      Rrd
6       6      vaU      Qkt

And so if we attempt to pipe together a few functions, and then separate the contents of the first column in the group_by, which in this example is the_chr1, here's what happens:

library(dplyr); library(tidyr)
df %>% 
  group_by(the_chr1, the_chr2) %>% 
  summarize(mean_i = mean(the_num)) %>% 
  separate(the_chr1, c('first_bit', 'second_bit'), sep = 1)

The result is an unexpected Error: unknown column 'the_chr1'

However, if we try to separate the second column in the group_by (here it's the_chr2), it works fine:

 df %>% 
  group_by(the_chr1, the_chr2) %>% 
  summarize(mean_i = mean(the_num)) %>% 
  separate(the_chr2, c('first_bit', 'second_bit'), sep = 1)
Source: local data frame [30 x 4]
Groups: the_chr1 [10]

   the_chr1 first_bit second_bit mean_i
     (fctr)     (chr)      (chr)  (dbl)
1       Aeq         h         uX     15
2       Aeq         R         rd      5
3       Aeq         W         GJ     25
4       Fiq         F         OU     11
5       Fiq         L         MH      1
6       Fiq         y         IE     21
7       FlV         G         da     19
8       FlV         i         pU     29
9       FlV         l         Yn      9
10      hPy         A         MN      7
..      ...       ...        ...    ...

Of course it works file if we group_by and separate on the same one column:

df %>% 
  group_by(the_chr2) %>% 
  summarize(mean_i = mean(the_num)) %>% 
  separate(the_chr2, c('first_bit', 'second_bit'), sep = 1)
Source: local data frame [30 x 3]

   first_bit second_bit mean_i
       (chr)      (chr)  (dbl)
1          A         MN      7
2          C         ur     18
3          e         rc     24
4          F         OU     11
5          G         da     19
6          h         dv      2
7          H         pE      4
8          h         uX     15
9          h         Wv     28
10         I         JP     27
..       ...        ...    ...

So there seems to be a bit of a problem with separate handling data frames with multiple grouping variables.

benmarwick commented 8 years ago

Thanks!