r-lib / rlang

Low-level API for programming with R
https://rlang.r-lib.org
Other
509 stars 138 forks source link

Can we use sym() with a vector of strings? #321

Closed md0u80c9 closed 6 years ago

md0u80c9 commented 6 years ago

Hi,

I've been trying to solve a problem using tidyeval and encountered that sym can be used with either a string, or a list of strings, but weirdly not a vector of strings (or more precisely a named vector of strings). I think this would be a useful addition just using the same syntax of sym().

It may be helpful to describe the problem I've been trying to solve in order to understand the use of the vectors os strings at all, and it would be helpful just in case there is a much better way of solving my problem and the need for sym with a vector is invalid.

I'm working on a tibble of approximately 300,000 rows and 300 columns looking at stroke care (each row is a stroke patient admission). We have an initially large dataset that we then produce a series of calculated fields for (ie. summations or other calculations on the raw data). We then undertake a load of aggregated summary calculations to produce Key Indicators which are used in reports, with the data grouped by a team and a time period (eg. quarterly or monthly dependent on the report. Each Key Indicator is measured in two ways, one 'Patient Centred - PC', and one 'Team Centred - TC' result (a patient's stroke care can be delivered by multiple different teams so the team measures only the care delivered by that organisation whereas the patient measures record care delivered to any patient who comes in contact with a team, regardless of whether the team delivered it or not). So, PC and TC fields are paired - we usually either want a summary of just the TC fields, or of both TC and PC fields for other reports.

Some of the calculations are complex and need to be right, so I've structured the code to make unit testing easier into sets of indicators are grouped into domains. I've then used the quos command to construct a list of functions for each domain, which we then join together to carry out one giant mutate (for calculated fields) or summarise (for the aggregated values) in one go.

So for one of the domains, here is the function definitions, and the variable names set as strings (so bear in mind there will be another 8 or 9 files with further function definitions - many more complex than these):

d1ki_aggr_value_names <- c( kiclockstarttobrainimagingmins = "KIClockStartToBrainImagingMins", kibrainimagingwithin1hr = "KIBrainImagingWithin1hr", kibrainimagingwithin12hrs = "KIBrainImagingWithin12hrs")

d1ki_aggr_value_functions <- rlang::quos( KIMedianBrainImagingTime = median( !!d1ki_aggr_value_names["kiclockstarttobrainimagingmins"], na.rm = TRUE), KIPCScannedIn1Hr = nonNApercentage( !!d1ki_aggr_value_names["kibrainimagingwithin1hr"]), KIPCScannedIn12Hrs = nonNApercentage( !!d1ki_aggr_value_names["kibrainimagingwithin12hrs"]))

We join d1ki_aggr_value_functions to similar lists of functions, and perform a dplyr::summarise operation on the source data. For the unit tests, you create a minimal test set and perform just the one set of functions on the test set. So far so good.

Obviously names is starting life as a vector of named strings here. Each of the strings starting with 'KI' is actually the suffix of a pair of column names: TCKIClockStartToBrainImagingMins and PCKIClockStartToBrainImagingMins for example. My cunning plan was that we could append the prefix to the strings using paste0, then turn all the strings into symbols, merge the functions quos lists together for all the domains, and execute the summary for one or both of the pair of results. The purpose of having the vector of strings is hopefully therefore obvious from the example.

Of course then sym doesn't work on vectors so we'd have to loop through each one individually.

Firstly, is there a reason sym doesn't work on vectors of strings (or could it be something which could be implemented)? Secondly - are there any better suggestions for the implementation above?

david-jankoski commented 6 years ago

just as an idea, not sure if you have tried this - but you could maybe look into dplyr::summarise_at() which has

md0u80c9 commented 6 years ago

Thanks David. I did consider this - the problem is that we aren't applying the same function to the same column, so I think you'd end up needing to call summarise_at for each of the column pairs. My aim was to try to get the summary done with the fewest dplyr calls as possible (to avoid copying the tibble), whilst maintaining a way to easily unit test the functions in smaller, reproducible chunks (hence the abstraction of the summary functions from the summarising action).

I did realise the somewhat obvious after posting that I can of course make a list of strings and refer to them with d1ki_aggr_value_names[["kibrainimagingwithin12hrs"]]. It does however seem syntactically clumsy to have a list of strings rather than a vector of strings.

lionel- commented 6 years ago

I think this question would be better posted on stackoverflow or community.rstudio.com. Also I would suggest presenting the problem as succintly as possible. You can use syms() on character vectors, if that helps.

to avoid copying the tibble

There is no overhead on tibble copying. Columns will only be copied individually when at least one element changes.

md0u80c9 commented 6 years ago

Thanks lionel-.

Sorry - my query was more about the action and return values of sym rather than the problem itself (which was described just to demonstrate the use case and to make sure it wasn't due to an error in the way I was using sym first). Apologies for not compressing the example down more.

If you play around with a function like paste0 which behaves in the way I would have expected, you get the following actions:

test <- c('foo', 'bar')
paste0('Hello ',test)
#> (returns character vector of 'Hello foo', 'Hello bar'.

sym(test)
#> Error: Only strings can be converted to symbols (ie. sym singular can't be used on vector).

You can use syms(test). But then you'll get a list, not a vector:

[[1]]
Foo

[[2]]
Bar

Edit: The above was with rlang 0.1.4 just in case the very latest work not on CRAN differs.

lionel- commented 6 years ago

I am not sure what you expect sym() should return on character vectors? (I have edited your post with markdown code fences for readability)

md0u80c9 commented 6 years ago

Shouldn't it be able to return a vector of syms rather than a list?

lionel- commented 6 years ago

There is no atomic vectors of symbols in R ;)

md0u80c9 commented 6 years ago

Aha! OK - that explains it! Thanks very much!

lionel- commented 6 years ago

You can still use !!! with the list of symbols, e.g. select(mtcars, !!! syms(c("cyl", "am")).

md0u80c9 commented 6 years ago

Yeah I've actually done that in the code last night (after posting) and that does work. It just appeared odd in the syntax as I was mentally expecting to still be using a vector and suddenly had a list.

lcmercado commented 6 years ago

First time I use symsand !!!. I was also struggling to find out why sym did not work for a vector of strings. Thanks!

iago-pssjd commented 5 years ago

I get an error related to comments here. I have no problem if I use the instruction

> anscombe %>% 
   mutate(sum_x = rowSums(map_dfc(
     1:4,
     ~ anscombe[[paste0("x",.)]]>10
   ), na.rm = T))
   x1 x2 x3 x4    y1   y2    y3    y4 sum_x
1  10 10 10  8  8.04 9.14  7.46  6.58     0
2   8  8  8  8  6.95 8.14  6.77  5.76     0
3  13 13 13  8  7.58 8.74 12.74  7.71     3
4   9  9  9  8  8.81 8.77  7.11  8.84     0
5  11 11 11  8  8.33 9.26  7.81  8.47     3
6  14 14 14  8  9.96 8.10  8.84  7.04     3
7   6  6  6  8  7.24 6.13  6.08  5.25     0
8   4  4  4 19  4.26 3.10  5.39 12.50     1
9  12 12 12  8 10.84 9.13  8.15  5.56     3
10  7  7  7  8  4.82 7.26  6.42  7.91     0
11  5  5  5  8  5.68 4.74  5.73  6.89     0

But, if I try to avoid mentioning the data frame anscombe inside map_dfc with rlang::sym, I get

> anscombe %>% 
   mutate(sum_x = rowSums(map_dfc(
     1:4,
     ~ !!sym(paste0("x",.))>10
   ), na.rm = T))
Error: Only strings can be converted to symbols
Call `rlang::last_error()` to see a backtrace

I understand that paste0("x",.) is just a string, not a vector of more than 1 element. I also tried paste0("x",., collapse=""), but the result is worse ever.

Why this happens? Could you help me to avoid this issue?

Thank you very much!

lionel- commented 5 years ago
  1. !! is executed by mutate(). At which point, . is a data frame.

  2. The data frame gets converted to a character vector by paste0(). Try running paste("foo", mtcars) to see an example of such a conversion.

  3. sym() gets called with the character vector.

We've been contemplating not unquoting beyond calls to function and ~ to avoid this sort of unexpected evaluation timing. Just opened an issue in https://github.com/r-lib/rlang/issues/845 to keep track.

lionel- commented 5 years ago

To avoid this, just create the function outside of mutate() and pass it to map_dfc().

lionel- commented 5 years ago

To avoid this, just create the function outside of mutate() and pass it to map_dfc().

oops, now that I read your code again, I see this won't work. Something along the lines of:

anscombe %>%
   mutate(sum_x = rowSums(map_dfc(1:4, ~ anscombe[[paste0("x", .)]] > 10), na.rm = T))

Note that I refer to anscombe here because the subscripts of .data[[ are evaluated exactly at the same time as !!. We did this to solve difficult issues in other cases, but it does make it trickier to use with purrr-in-mutate patterns because of the issues of evaluation timing.

In general, I'd suggest reviewing your approach because this map_dfc + rowsums + mutate pattern is quite complicated to follow and get right.

iago-pssjd commented 5 years ago

Thank you, Lionel.

When I wrote map_dfc(1:4, ~ !!sym(paste0("x",.))>10) I thought that . referred to the integers 1:4, as in the working code above. If . is the data frame when executed with !!, is there currently some way to refer to the list 1:4 introduced as the first input of map_dfc? I understand that the issue that you opened is about this problem. Thanks for it.

In general, I'd suggest reviewing your approach because this map_dfc + rowsums + mutate pattern is quite complicated to follow and get right.

In fact, I was looking for an (enlightening) alternative to reshape the data frame to long to apply the condition of interest and return to the original data frame, and I was trying to shorten it as much as possible, and this approach was the best working that I found.

Thank you, anyway, and I'll follow closely the issue #845 .

lionel- commented 5 years ago

You could use get(paste0("x", .)) > 10, maybe that's the best way of doing this actually!