tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.74k stars 2.12k forks source link

Equivalent of SQL CASE #631

Closed edwindj closed 8 years ago

edwindj commented 9 years ago

In data cleaning data often contains values that one would like to impute/replace.

For example:

In a survey for smokers, one would not ask a person his or her cigarette consumption when he/she has indicated not to be a smoker. So technically the number of cigarretes is not measured, but it can be deduced to be zero.

smoke <- read.csv(textConnection(
"name,is_smoker,cigarrettes,cigares
 John,TRUE,15, 0
 Adam,FALSE,NA,NA
 Phil,TRUE,0,1
"
))

#standard R way
smoke$cigarrettes[!smoke$is_smoker & is.na(smoke$cigarrettes)] <- 0

How does one do this elegantly in dplyr?

My 2 cents:

#dplyr option 1: split the data set, fix one part and combine the two
smoker_true <- smoke %>% filter(!is_smoker, is.na(smoke$cigarettes))
# complement
smoker_false <- smoke %>% filter(is_smoker | !is.na(smoke$cigarettes))

smoke2 <- rbind( smoker_true %>% mutate(cigarrettes=0)
               , smoker_false
)

which I don't find elegant dplyr code.

So try a little harder

# dplyr option 2: replace the whole variable and do the filtering in the assignment
smoke2 <- 
  smoke %>%
  mutate(cigarrettes=ifelse(!is_smoker & is.na(cigarrettes), 0, cigarrettes))

It works, but this is not efficient/workable for the database backends and I find it unelegant to do the filtering in the assignment.

I suggest a new method: mutate_where that allows for changing values of existing variables of a subset of a tbl. So something like

smoke2 <-
   smoke %>% mutate_where(.where=!is_smoker&is.na(cigarrettes), cigarettes=0)

Or another option (in analogy with group and grouped_<tbl>) to change filter or introduce filtered_<tbl>/add_filter: setting a filter lets one work on a subset of the data, but one can remove_filter.

smoke2 <-
   smoke %>% 
   add_filter(is_smoker, is.na(cigarrettes) %>%
   mutate(cigarrettes=0) %>%
   remove_filter()

Any better ideas or suggestions?

hadley commented 9 years ago

I really prefer the vector based solution, but it would be much simpler if ifelse had a branch for NA:

smoke2 <- 
  smoke %>%
  mutate(cigarrettes = ifelse(!is_smoker, 0, cigarrettes, 0)
)

The problem is that I don't see how any other approach could work with databases - you can not assume that you have write access to the underlying table.

edwindj commented 9 years ago

Agreed :-)

While I think it is possible to implement it in DBs using temporary tables and multiple SQL statements, I think it is wise not to do so: it will complicate the translate_sql code and furthermore it is rather DB specific how TEMP tables work.

hadley commented 9 years ago

I'd like to approach this problem by figuring out what the 90% of most common in-place mutates do, and then figuring out how to make elegant vectorised replacement functions.

hadley commented 9 years ago

I'm thinking of adapting something like derivedFactor() from mosaic: https://github.com/rpruim/mosaic/blob/master/R/derivedFactor.R - not quite sure how it should look for this case yet.

@rpruim any thoughts?

rpruim commented 9 years ago

I've been thinking about this today and have an idea to float. We could use formulas of the form

expression ~ condition     # use expression when condition is TRUE
expression ~ .             # use expression when no other conditions are TRUE (i.e., default)

to specify the desired variable/vector. This is modeled on mathematical definitions by cases, so below I'll call the function cases(). The smoking example would turn into

smoke %>% mutate( 
  cigs2 = cases( 
    0 ~ ! is_smoker & is.na(cigarrettes), 
    cigarettes ~ .)
  )

And fairly complicated definitions by cases could be constructed similarly.

As with derivedFactor(), we could have options to take the first true condition, the last true condition, or to require that exactly one condition be true. This could subsume derivedFactor if there were an argument that declared the result to be a factor and used expression as labels.

Using formulas makes it easy to have arguments with natural names (rather than prepending with . or .. or _ to avoid colliding with names in a data frame). The formulas could all be part of ... and named arguments could control the overall behavior.

cases <- function( ..., match=c("unique", "first", "last"), factor=FALSE, ordered=FALSE) {
}

What do you think?

kevinykuo commented 9 years ago

Haven't given any thought on the viability for implementation, but here's an idea, inspired by @rpruim and pattern matching:

smoke %>% mutate( 
  cigs2 = cigarettes %>% match_with(
     NA ~ 0,
     . ~ cigarettes)
  )
rpruim commented 9 years ago

Hmm. Note sure you really want this. Rather, I think its use case would be too restricted.

In your case, you only work with one variable, but in general both the conditions and the expressions might refer to multiple variables. Remember that %>% is just an abbreviation, so you are really proposing

cigs 2 = match_with(cigarettes, NA ~ 0, . ~ cigarettes)

What I'm proposing would work on vectors (which could be evaluated in a data frame or elsewhere) and would make it simple to create either expressions or conditions from multiple vectors. mutate() would take care of evaluating them in the object that is its first argument.

I also think I prefer my original order for the formulas. The expression is more like a response and the condition more like an explanatory variable, so it seems to fit the model of other uses of formulas better.

hadley commented 9 years ago

I think I'd prefer having the result on the RHS:

cases( 
    !is_smoker & is.na(cigarettes) ~ 0,
    TRUE ~ cigarettes)
  )

Using TRUE as a catch all for the default seems reasonable. Or you could use a one-sided formula: ~ cigarettes). I like the use of the formula to do explicit quoting. It also means that you'll be able to program withcases()` because it wouldn't need to use NSE.

Taking an example from `derivedFactor():

cases(
  i1 == 0 ~ "abstinent",
  (i1>0 & i1<=1 & i2<=3 & sex=='female') | (i1>0 & i1<=2 & i2<=4 & sex=='male') ~ "moderate"
  ((i1>1 | i2>3) & sex=='female') | ((i1>2 | i2>4) & sex=='male') = "highrisk"
)

# Or by moving our the "or"s
cases(
  i1 == 0 ~ "abstinent",
  i1>0 & i1<=1 & i2<=3 & sex=='female') ~ "moderate",
  i1>0 & i1<=2 & i2<=4 & sex=='male') ~ "moderate",
  (i1>1 | i2>3) & sex=='female') ~ "highrisk"
  (i1>2 | i2>4) & sex=='male') ~ "highrisk"
)

# You could also nest cases to make the male/female distinction clearer
cases(
  i1 == 0 ~ "abstinent",
  sex == 'female' ~ cases(
    i1>0 & i1<=1 & i2<=3 ~ "moderate",
    i1>1 | i2>3  ~ "highrisk"
  ),
  sex == 'male' ~ cases(
    i1>0 & i1<=2 & i2<=4 ~ "moderate",
    i1>2 | i2>4 ~ "highrisk"
  )
)

# Or with a default
cases(
  i1 == 0 ~ "abstinent",
  i1>0 & i1<=1 & i2<=3 & sex=='female') ~ "moderate",
  i1>0 & i1<=2 & i2<=4 & sex=='male') ~ "moderate",
  TRUE ~ "highrisk"
)
matthieugomez commented 9 years ago

Creating a new variable (as in hadley's last post) is conceptually quite different from modifying some value of a variable (as in the OP's post) - for instance, default behaviors for what happens with non matched rows should differ (resp NA vs original value). There could two different verbs instead of cases depending on the situation, similarly to Stata recode vs replace

hadley commented 9 years ago

@matthieugomez yes, agreed. In SQL there are two forms of CASE: one works on the values of a single variable, the other takes a list of conditionals (analogous to the case function proposed here).

But the challenge is NA: the usual semantics for equality won't work here. (And there's a general problem for floating point values).

rpruim commented 9 years ago

One advantage of the expression ~ condition syntax (besides mirroring the way cases are presented notationally in books) is that often the expressions are simpler and more similar to each other, so the code looks more elegant. Also, knowing what the expression is often makes it clearer why the condition is defined the way it is. But the expressions get a bit buried in the condition ~ expression order.

cases(
  "abstitnent" ~ i1 == 0,
  "moderate"   ~ i1>0 & i1<=1 & i2<=3 & sex=='female'),
  "moderate    ~ i1>0 & i1<=2 & i2<=4 & sex=='male'),
  "highrisk"   ~ (i1>1 | i2>3) & sex=='female'),
  "highrisk"   ~ (i1>2 | i2>4) & sex=='male')
)

Whichever way the formulas are used, I like the nesting idea.

hadley commented 9 years ago

@rpruim That is true, but I think it makes nesting less natural, and you can achieve the nice alignment with lots of whitespace

cases(
  i1 == 0                              ~ "abstinent",
  i1>0 & i1<=1 & i2<=3 & sex=='female' ~ "moderate",
  i1>0 & i1<=2 & i2<=4 & sex=='male'   ~ "moderate",
  (i1>1 | i2>3) & sex=='female'        ~ "highrisk"
  (i1>2 | i2>4) & sex=='male'          ~ "highrisk"
)

@matthieugomez I don't see the advantage of that over the filter based approach. Sure it save a little memory, but at the expense of making the API non-orthogonal. The goal of dplyr isn't to squeeze the last bit of performance out of R, it's to have provide a elegant and efficient API for data manipulation.

rpruim commented 9 years ago

Hmm... The nesting argument is probably reason enough to prefer condition ~ expression because in that case, the condition becomes simpler and the expression becomes more involved.

hadley commented 9 years ago

To make the semantics clear, I think each vector referenced in cases() would need to be either a scalar, or of the same length. Doing recycling in this scenario is going to be quite complex.

I'm also reasonably certain this will be straightforward to translate to SQL.

hadley commented 9 years ago

Here's a start on an implementation: https://gist.github.com/hadley/2751ba61d1c7f4eaacab

Currently it requires you pass the data frame as the first argument, but I think I can work around that. It's not very efficient, but it's a starting point.

dholstius commented 9 years ago

For what it's worth: here's a patch verb that I'm starting to use in my own work:

https://gist.github.com/holstius/cbc4ec40057fbc2d9f4b

In my workflows, I like to separate "patching" from "recoding" or "labeling". Patching is the real-world consequence of deadlines + busy upstream providers — it's not supposed to be there, but it is.

hadley commented 8 years ago

With the new if_else and coalesce functions, you could now write the original motivating example as

mutate(smoke, cigarettes = if_else(coalesce(is_smoker, FALSE), cigarettes, 0L))

And note that this discussions is really about a sql CASE equivalent for R, not a vectorised switch. The discussion for vectorised switch is now taking place in #1710.

I've also deleted a number of comments in this thread that are no longer pertinent.

hadley commented 8 years ago

And now you can do:

mutate(smoke, cigarettes = if_else(is_smoker, cigarettes, 0L, 0L))
hadley commented 8 years ago

I'm pretty happy with this implementation, but I'd love to get everyone's feedback on the name and how it works.

lionel- commented 8 years ago

These new functions are awesome!

case_when()

It could be named match_cases() or match_patterns().

I find the tilde a bit hard to parse visually with a long list of cases. Maybe allow the := operator? The only thing to do would be to export :=` <- `~.

match_cases(
  x %% 35 == 0 := "fizz buzz",
  x %% 5 == 0 := "fizz",
  x %% 7 == 0 := "buzz",
  TRUE := as.character(x)
)
recode()

Would it be too unsafe to coerce the default value automatically?

recode(x, 5, 4, 3, default = x)
#> Error: `default` has type 'integer' not 'double'

recode(x, "a", "b", "c", default = x)
#> Error: `default` has type 'integer' not 'character'

recode(x, "a", "b", "c", default = NA)
#> Error: `default` has type 'logical' not 'character'

against the more verbose:

recode(x, 5, 4, 3, default = as.double(x))
recode(x, "a", "b", "c", default = as.character(x))
recode(x, "a", "b", "c", default = NA_character_)
hadley commented 8 years ago

I think I prefer ~ for now - you can always add more spaces around it to make it more obvious.

I'm also in an explicit mood at the moment, so I'd prefer to require the default to be correctly typed. Also the logic starts to get a bit hairy if there are no arguments (just the default).

lionel- commented 8 years ago

I think recode() will be very hard to use with numeric data. It would be great to be able to do:

recode(x, `1` = 5, `2` = 4)

This would ideally work with both integers and integer-like doubles (as often integer data is stored in a double).

gergness commented 8 years ago

Agreed, looks great!

What about using the formula notation from case_when for recode? This would help with lionel's point about numeric data, and also could prevent name conflicts for x, default and missing.

x <- c("x", "y", "z")
recode(x, x = "a", y = "b", z = "c")
#>Error: All replacements must be named
kevinushey commented 8 years ago

I also don't like using named arguments for recode() as the names are forced to be / become symbols, whereas ~ or := preserve the LHS type. And recode(x,1= 2) is pretty ugly.

hadley commented 8 years ago

@kevinushey yeah, but I don't see that being a particularly common use case. And I think using ~ means that we'll need to do some work to make case_when() work with the hybrid evaluator.

krlmlr commented 8 years ago

In 4920c8b5f, I need to use .$ to access columns when using case_when() in mutate() -- is this intended?

> mtcars %>% tbl_df %>%  mutate(size = case_when(cyl > 6 ~ "big", TRUE ~ "small"))
Error: object 'cyl' not found
> mtcars %>% tbl_df %>%  transmute(size = case_when(cyl > 6 ~ "big", TRUE ~ "small"))
Error: object 'cyl' not found
> mtcars %>% tbl_df %>%  transmute(size = case_when(.$cyl > 6 ~ "big", TRUE ~ "small"))
Source: local data frame [32 x 1]

    size
   (chr)
1  small
2  small
3  small
4  small
5    big
6  small
7    big
8  small
9  small
10 small
..   ...

(Is this meant by "some work with the hybrid evaluator"?)

krlmlr commented 8 years ago

Re formula notation for recode(): It's also a question of API consistency.

hadley commented 8 years ago

@krlmlr see #1719

I think the point of if_else, recode and case_when is not to have exactly equivalent interfaces but to solve a series of increasingly more complicated/general situations that each require progressively more syntax.

krlmlr commented 8 years ago

I agree to "more syntax", but to me this doesn't necessary imply that each syntax should be different.

A different implementation of case_when(), that is also consistent with if_else() and perhaps doesn't suffer from #1719, would be

case_when <- function(condition, result, ...) {
  if (missing(condition)) NULL
  else
    if_else(condition, result, case_when(...))
}

(replace the if_else() call with its body).

sfirke commented 8 years ago

re: @krlmlr at https://github.com/hadley/dplyr/issues/631#issuecomment-197224743, is .$ indeed the intended syntax? I ask because I was going to suggest adding an example of case_when with mutate to the documentation. Use with mutate will be common and some users might need a pointer toward .$. since in their current nested ifelse calls they don't need it.

Example:

mtcars %>% mutate(
  efficiency = case_when(
    .$mpg < 22 ~ "low",
    .$mpg <= 26  ~ "moderate",
    TRUE ~ "high"
  )
)
hadley commented 8 years ago

@sfirke no - the goals is for case_when() to work as is inside mutate().

bergsmat commented 8 years ago

Loving case_when, but encountering results I don't fully understand (dplyr 3074c).

x <- expand.grid(y=5,z = c(T,F))
x <-  group_by(x,z)
mutate(          x, y = case_when(   z ~ n()         )) # (no error)
mutate(          x, y = case_when(   z ~ n(),!z ~ 0L )) # (no error)
mutate( ungroup(x), y = case_when(   z ~ n(),!z ~ 0L )) # object 'z' not found
mutate(          x, y = case_when(   z ~ 1L          )) # object 'z' not found
mutate(          x, y = case_when( .$z ~ 1L          )) # object 'z' not found
hadley commented 8 years ago

Yes, it doesn't yet work inside mutate.

mvabl-steve commented 6 years ago

I've been teaching myself R from scratch, basically by undertaking something, then reading posts like these, and trial-and-error based on that. Sometimes I hit a wall and reach out.

I've hit a wall. I have dplyr 0.7 installed. I have a tibble with a column - call it contract_key - I added by applying mutate(coalesce()) to three other columns in the tibble. Here's sample data, which I'll of course refer to later as df:

product,supplier,vendor,quantity,gross,contract_key 168145850127,person1,org1,7,8.14E-04,person1 655393265191,person5,org2,7,0.041861099,person5 655393265191,person3,org3,5,0.019268984,person3 168145850127,person4,org2,4,0.023245343,org2 928893912852,person5,org2,2,0.002285355,person5 NA,person6,org1,3,NA,person6 168145850127,person7,org1,1,NA,person7 350468621217,person8,org2,2,0.004212372,person8 977939797847,person9,org1,1,0.001832018,person9 168145850127,person10,org3,6,0.043907375,168145850127 168145850127,person5,org2,3,0.020960582,168145850127 168145850127,person5,org2,1,0.006923272,person5 350468621217,person11,org1,2,0.002730578,350468621217 350468621217,person5,org2,1,0.005505893,person5 977939797847,person11,org1,18,0.023260808,org1 350468621217,person12,org2,1,0.003525162,org2 168145850127,person13,org4,1,0.001932924,person13 655393265191,person5,org2,7,0.016847417,person5 NA,person14,org5,2,NA,NA 928893912852,person5,org2,2,4.22E-04,person5 168145850127,person5,org2,1,5.82E-04,person5 168145850127,person5,org2,1,0.007305813,168145850127 866586597461,person5,org2,8,0.051104768,866586597461 977939797847,person5,org2,6,0.007783279,person5 866586597461,person5,org2,4,0.0225,866586597461

Here's how I generated contract_key:

df <- df %>% mutate(contract_key = coalesce(product,supplier,vendor))

I now want to add another column that categorizes the contents of contract_key based on which of the three columns provided the content (through coalesce()). So if contract_key ="person5", for example, the new column, contract_level, would be "supplier". And contract_key="org2" would map to contract_level = "vendor", etc.

Essentially, I'll be using contract_level as a join variable to another tibble.

I'm stumped. I've tried if_else, and I see that I shouldn't bother trying case_when (because it's inside mutate()).

It's probably basic R syntax that I just don't know. But if someone supplies the answer, I will backtrace until I figure out what you did. (And I'll have learned a new lesson in R!)

Thanks!

mvabl-steve commented 6 years ago

thanks @friep. i didn't realize the unwritten rules for SO v git. (Seriously, I didn't. Sincere thanks.)

I'll try SO, and all its attendant downvoting and judging! :)