tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.38k stars 417 forks source link

Spreading NA keys makes NA column names #68

Closed gregmacfarlane closed 8 years ago

gregmacfarlane commented 9 years ago

Trying to create a cross tabulation using tally() combined with spread() works nicely, unless there are missing values in the key columns.

mtcars %>% 
  group_by(cyl, gear) %>%
  tally() %>%
  spread(gear, n, fill = 0)
Source: local data frame [3 x 4]

  cyl  3 4 5
1   4  1 8 2
2   6  2 4 1
3   8 12 0 2

mtcars %>%
  mutate(
    cyl = ifelse(cyl > 6, NA, cyl),
    gear = ifelse(gear > 4, NA, gear)
  ) %>%
  group_by(cyl, gear) %>%
  tally()
Source: local data frame [8 x 3]
Groups: cyl

  cyl gear  n
1   4    3  1
2   4    4  8
3   4   NA  2
4   6    3  2
5   6    4  4
6   6   NA  1
7  NA    3 12
8  NA   NA  2

# DITTO # %>%
  spread(gear, n)
Error in if (any(names2(x) == "")) { : 
  missing value where TRUE/FALSE needed

It seems like a good solution would be to either treat NA values as another key or to drop them from the resulting table, controlled by an option switch. Like useNA = "always" in the base table() function, but with better syntax.

zross commented 9 years ago

Thanks for adding this issue. I second it. At a minimum a warning "There cannot be NA values in the key variable" would be useful. As always, thanks for this very useful package Hadley.

hadley commented 9 years ago

Simpler example:

data.frame(x = c(1, 2, NA), y = 1, z = c(1, 2, 3)) %>%
  tbl_df() %>%
  spread(x, z)

Note that it's actually a dplyr problem since this works:

data.frame(x = c(1, 2, NA), y = 1, z = c(1, 2, 3)) %>%
  spread(x, z)

Should fix by turning NA column names into something valid <NA> ?

zross commented 9 years ago

I think that makes sense. Or an argument that allows the user to decide along the lines of useNA=c(TRUE, FALSE) and if TRUE then turn NA to something valid.

jennybc commented 9 years ago

I just tripped on this same thing. Before I found this issue, I had made my own example, so will include here anyway. As for what to name the columns, it seems like an option to prepend the key name would be useful here and more generally.

jtxt <- "
name, grp,    sex, age
 joe,   1,   male,  30
 jan,   1,     NA,  43
 jen,   1, female,  44
 pat,   2,     NA,  24
 pam,   2, female,  63
 peg,   2, female,  50
"
jdat <- read.csv(text = jtxt, stringsAsFactors = FALSE, strip.white = TRUE)

Cross-tabulating and spreading with (sex, grp) works:

> jdat %>%
+ count(sex, grp) %>%
+ spread(grp, n, fill = 0)
Source: local data frame [3 x 3]

     sex 1 2
1 female 1 2
2   male 1 0
3     NA 1 1

But going the other way round, (grp, sex), does not:

> jdat %>%
+ count(grp, sex) %>%
+ spread(sex, n, fill = 0)
Error in if (any(names2(x) == "")) { : 
  missing value where TRUE/FALSE needed

To illustrate my proposal re column names, here's possible output:

grp sex_female sex_male sex_NA
  1          1        1      1
  2          2        0      1

The prepending would fix up the value of NA as a column name and would also help when the values are cryptic, e.g. naked integers like cyl and gear from mtcars.

gregmacfarlane commented 9 years ago

@jennybc I like the idea of prepending. It will resolve the problem where kable(spread()) needs some in-text explanation as to what it is showing.

grp          1        2      3
  1          1        1      1
  2          2        0      1
hadley commented 9 years ago

@jennybc Seems like most of the time you'll want the key name prepended if the value is non-character, but most of the time you won't want it prepended if it is a character. So maybe something like include_key = !is.character(value) ?

jennybc commented 9 years ago

Yes, @hadley, I think you're right about when prepending the key is most desirable. That, together with the ability to have a variable/column named <NA>, would represent a nice gain in functionality for spread().

hadley commented 8 years ago

This no longer fails for me, presumably because dplyr does a better job of handling columns names that are missing.

I'm a bit leery of automatically changing the column names, because traditionally that's been done in a separate step (which is long but makes things more explicit, and sticks to the principle of each function doing one thing).

hadley commented 8 years ago

Implemented with sep argument. If anyone strongly dislikes this API, please speak up ASAP!

data_frame(x = c(1, NA), y = 1:2) %>% 
  spread(x, y) %>% 
  names()
#> [1] "1"    "<NA>"

data_frame(x = c(1, NA), y = 1:2) %>% 
  spread(x, y, sep = "_") %>% 
  names()
#> [1] "x_1"  "x_NA"
zross commented 8 years ago

A very useful addition!

ghost commented 6 years ago

I think the naming of columns would be more intuitive if the default name when sep is specified was "" to edit the above example:

jtxt <- "
name, grp,    sex, age
 joe,   1,   male,  30
 jan,   1,     NA,  43
 jen,   1, female,  44
 joe,   2,     NA,  24
 jan,   2, female,  63
 jen,   2, female,  50
"
jdat <- read.csv(text = jtxt, stringsAsFactors = FALSE, strip.white = TRUE)

If we spread this:

spread(jdat,grp,sex,sep = "_")
 name age  grp_1  grp_2
1  jan  43   <NA>   <NA>
2  jan  63   <NA> female
3  jen  44 female   <NA>
4  jen  50   <NA> female
5  joe  24   <NA>   <NA>
6  joe  30   male   <NA>

grp_1 is not a very descriptive title where as sex_1 would be more helpful.

It would also help make spread() more useful if you could spread more than one variable. In this example the age variable has been separated from group and it not longer makes sense. Alternatively, if there was an option ie keep_key=TRUE which would allow you to keep the grp column intact then you could spread the age column separately.