Closed gregmacfarlane closed 8 years ago
Thanks for adding this issue. I second it. At a minimum a warning "There cannot be NA values in the key variable" would be useful. As always, thanks for this very useful package Hadley.
Simpler example:
data.frame(x = c(1, 2, NA), y = 1, z = c(1, 2, 3)) %>%
tbl_df() %>%
spread(x, z)
Note that it's actually a dplyr problem since this works:
data.frame(x = c(1, 2, NA), y = 1, z = c(1, 2, 3)) %>%
spread(x, z)
Should fix by turning NA column names into something valid <NA>
?
I think that makes sense. Or an argument that allows the user to decide along the lines of useNA=c(TRUE, FALSE) and if TRUE then turn NA to something valid.
I just tripped on this same thing. Before I found this issue, I had made my own example, so will include here anyway. As for what to name the columns, it seems like an option to prepend the key name would be useful here and more generally.
jtxt <- "
name, grp, sex, age
joe, 1, male, 30
jan, 1, NA, 43
jen, 1, female, 44
pat, 2, NA, 24
pam, 2, female, 63
peg, 2, female, 50
"
jdat <- read.csv(text = jtxt, stringsAsFactors = FALSE, strip.white = TRUE)
Cross-tabulating and spreading with (sex, grp)
works:
> jdat %>%
+ count(sex, grp) %>%
+ spread(grp, n, fill = 0)
Source: local data frame [3 x 3]
sex 1 2
1 female 1 2
2 male 1 0
3 NA 1 1
But going the other way round, (grp, sex)
, does not:
> jdat %>%
+ count(grp, sex) %>%
+ spread(sex, n, fill = 0)
Error in if (any(names2(x) == "")) { :
missing value where TRUE/FALSE needed
To illustrate my proposal re column names, here's possible output:
grp sex_female sex_male sex_NA
1 1 1 1
2 2 0 1
The prepending would fix up the value of NA
as a column name and would also help when the values are cryptic, e.g. naked integers like cyl
and gear
from mtcars
.
@jennybc I like the idea of prepending. It will resolve the problem where kable(spread())
needs some in-text explanation as to what it is showing.
grp 1 2 3
1 1 1 1
2 2 0 1
@jennybc Seems like most of the time you'll want the key name prepended if the value is non-character, but most of the time you won't want it prepended if it is a character. So maybe something like include_key = !is.character(value)
?
Yes, @hadley, I think you're right about when prepending the key is most desirable. That, together with the ability to have a variable/column named <NA>
, would represent a nice gain in functionality for spread()
.
This no longer fails for me, presumably because dplyr does a better job of handling columns names that are missing.
I'm a bit leery of automatically changing the column names, because traditionally that's been done in a separate step (which is long but makes things more explicit, and sticks to the principle of each function doing one thing).
Implemented with sep
argument. If anyone strongly dislikes this API, please speak up ASAP!
data_frame(x = c(1, NA), y = 1:2) %>%
spread(x, y) %>%
names()
#> [1] "1" "<NA>"
data_frame(x = c(1, NA), y = 1:2) %>%
spread(x, y, sep = "_") %>%
names()
#> [1] "x_1" "x_NA"
A very useful addition!
I think the naming of columns would be more intuitive if the default name when sep
is specified was "
jtxt <- "
name, grp, sex, age
joe, 1, male, 30
jan, 1, NA, 43
jen, 1, female, 44
joe, 2, NA, 24
jan, 2, female, 63
jen, 2, female, 50
"
jdat <- read.csv(text = jtxt, stringsAsFactors = FALSE, strip.white = TRUE)
If we spread this:
spread(jdat,grp,sex,sep = "_")
name age grp_1 grp_2
1 jan 43 <NA> <NA>
2 jan 63 <NA> female
3 jen 44 female <NA>
4 jen 50 <NA> female
5 joe 24 <NA> <NA>
6 joe 30 male <NA>
grp_1
is not a very descriptive title where as sex_1
would be more helpful.
It would also help make spread()
more useful if you could spread more than one variable. In this example the age variable has been separated from group and it not longer makes sense. Alternatively, if there was an option ie keep_key=TRUE
which would allow you to keep the grp column intact then you could spread the age column separately.
Trying to create a cross tabulation using
tally()
combined withspread()
works nicely, unless there are missing values in the key columns.It seems like a good solution would be to either treat
NA
values as another key or to drop them from the resulting table, controlled by an option switch. LikeuseNA = "always"
in the basetable()
function, but with better syntax.