renkun-ken / pipeR

Multi-Paradigm Pipeline Implementation
Other
169 stars 39 forks source link

Handling incorrect usage of `.` (dot) #53

Closed kohske closed 10 years ago

kohske commented 10 years ago

On branch 0.5, I found this behavior.

> 1:3 %>>% sum(.)
[1] 12

Obviously this is incorrect usage of ., but anyway people will do this mistake very often. So it'd be better to invoke error.

This is because, in pipe_first,

 eval(as.call(c(fun[1L],quote(.),fun[-1L])),
    envir=list(.=x),enclos = envir)

is finally parsed as

as.call(c(sum, ., .), envir = list(. = x), enclos = envir)

Note that this behavior occurs when the functions take variable length of arguments.

kohske commented 10 years ago

And here is another bad scenario. If anyone has . in global env and try to use it inside pipe, the behavior would be unexpected.

> . <- "greater"
> 1:3 %>>% t.test(alternative = .)
Error in match.arg(alternative) : 
  'arg' must be NULL or a character vector
> .. <- "greater"
> 1:3 %>>% t.test(alternative = ..)

    One Sample t-test

data:  .
t = 3.4641, df = 2, p-value = 0.03709
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
 0.3141455       Inf
sample estimates:
mean of x 
        2 

One possible way to avoid this is using very very rare name (like GUID) instead of . in pipe_first:

 eval(as.call(c(fun[1L],quote(tcwMCyDiaABizOhvCSpjHITWQRRgKafCmpeIbBvSXV),fun[-1L])),
    envir=list(tcwMCyDiaABizOhvCSpjHITWQRRgKafCmpeIbBvSXV=x),enclos = envir)
renkun-ken commented 10 years ago

Thanks, @kohske! It is always a big headache to use a unusual symbol like . to represent something in an expression. If one does not read the documentation about the basic syntactic rules, such misuse can be hard to avoid.

pipeR's syntax for first-argument piping is quite simple: if a function (name or call) follows, left is always piped to the first argument on right. This rule is designed in particular to avoid symbol detection in the function arguments because:

  1. that function may use non-standard evaluation and has special interpretation for that symbol too. Such detection will mess up non-standard evaluation very quickly and unexpectedly.
  2. the behavior of the function call can become very unpredictable for fresh users, for example, if 1:10 %>>% sum(.) means sum(1:10), one then expects that 1:10 %>>% sum(.^2) also means sum((1:10)^2), so to what extent, the first argument piping should play? There's no place for first argument piping any more.
  3. the detection works at huge expense of performance.
> 1:3 %>>% sum(.)
[1] 12

If this should fail or raise warning, what about dplyr::do() and rlist::list.sort(), both use nonstandard evaluation and give . their own interpretations?

dplyr::do() interprets . as a unit data frame (maybe grouped)

> library(dplyr)
> mtcars %>>% group_by(vs) %>>% do(data = ., model = lm(mpg ~ wt + cyl, data = .))
Source: local data frame [2 x 3]
Groups: <by row>

  vs                         data   model
1  0 <S3:tbl_df, tbl, data.frame> <S3:lm>
2  1 <S3:tbl_df, tbl, data.frame> <S3:lm>

rlist::list.sort() and most other functions interpret . as an element in the given list.

> library(rlist)
> mtcars$mpg %>>% list.sort(.)
 [1] 10.4 10.4 13.3 14.3 14.7 15.0 15.2 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 19.2 19.2 19.7
[19] 21.0 21.0 21.4 21.4 21.5 22.8 22.8 24.4 26.0 27.3 30.4 30.4 32.4 33.9
> mtcars$mpg %>>% list.sort(desc(.))
 [1] 33.9 32.4 30.4 30.4 27.3 26.0 24.4 22.8 22.8 21.5 21.4 21.4 21.0 21.0 19.7 19.2 19.2 18.7
[19] 18.1 17.8 17.3 16.4 15.8 15.5 15.2 15.2 15.0 14.7 14.3 13.3 10.4 10.4

If such use as x %>>% f(.) causes warnings, there can be too many.

For the second comment, it is indeed an issue here. The feature of first-argument piping comes with . piping together to facilitate potential use in multiple place within a function call. One walk around is to use lambda expression to avoid symbol clash, another is a function that I'm considering to support, I(expr) which directly evaluates expr in the parent environment so that your code may be written like

. <- "greater"
1:3 %>>% t.test(alternative = I(.))

I've actually considered the GUID (or rare name) solution but it works poorly with functions that record their calling expressions such as plot(), lm() etc. If such messy names are used, your plot title will by default a messy name like this :)

Therefore I still need to think hard on how we could avoid potential misuse of features by users who do not read the documentation while at the same time ensure that the code works correctly if they are used correctly.

Anyway, thanks for your kind remind!

kohske commented 10 years ago

Just a quick comment.

I meant that 1:3 %>>% sum(.) should be 6. But at the moment 1:3 %>>% sum(.) returns 12. And this is because the function is evaluated as sum(1:3, 1:3) instead of sum(1:3).

With dplyr, 1:3 %>% sum(.) returns 6, I've not yet inspect the internal of dplyr though.

renkun-ken commented 10 years ago

For magrittr's %>%, it works in that way because if %>% detects any naked . in the function call, it will switch to dot piping rather than first-argument piping.

Please try more with it such as

> 1:3 %>% sum(.)
[1] 6
> 1:3 %>% sum(.+1)
[1] 15
> 1:3 %>% c(.)
[1] 1 2 3
> 1:3 %>% c(.+1)
[1] 1 2 3 2 3 4
> 1:3 %>% c(., mean(.))
[1] 1 2 3 2
> 1:3 %>% c(min(.), max(.))
[1] 1 2 3 1 3
> 1:3 %>% c((.))
[1] 1 2 3 1 2 3

If many of the arguments of a function need to be specified, and a . is hidden in a big mess of arguments, it can be hard to find out which mechanism is exactly used.

Maybe this issue is helpful: https://github.com/renkun-ken/pipeR/issues/18

kohske commented 10 years ago

Thanks for the reply!! In my environment, magrittr 1.0.1, 1:3 %>% sum(.+1) throws an error. Which version of magrittr?

> 1:3 %>% c(.)
[1] 1 2 3
> 1:3 %>% mean(.)
[1] 2
> 1:3 %>% c((.))
Error in eval(expr, envir, enclos) : object '.' not found
> 1:3 %>% c(mean(.))
Error in mean(.) : object '.' not found
renkun-ken commented 10 years ago

I was commenting with the dev version of magrittr which tries to support nested ., CRAN version and version imported by dplyr simply does not support nested ., which may be a good choice. But it brings a problem:

If users see sum(.) works, they may expect sum(.+1) to work too, because they don't see it "nested" while in fact it is sum(+(.,1)).

Currently, I can only beg users to read the docs first and know the basic syntax rules before misuses happen.

kohske commented 10 years ago

Thanks, I found the same behavior with yours. So now magrittr has same problem on dev branch... This is absolutely unexpected behavior for users, and even worse, user is difficult to be aware of this behavior...

> 1:3 %>% c(.)
[1] 1 2 3
> 1:3 %>% c(.+1)
[1] 1 2 3 2 3 4

> 1:3 %>>% c(.)
[1] 1 2 3 1 2 3
> 1:3 %>>% c(.+1)
[1] 1 2 3 2 3 4
renkun-ken commented 10 years ago

Yes, there are some prices to pay if you want any convenience. It's better that the price or consequence is predictable and easy to comprehend.

So I have to stress the rules like:

If you directly put a function there, it is ALWAYS first-argument piping!

A tip is Don't use . when it is unnecessary.*

magrittr's author has to stress the rule like:

If you put a function there, it may be first-argument piping or dot piping: if a naked dot is detected in the argument, it will be dot piping, or otherwise it would still be first-argument piping.

A tip is Be careful when you use arithmetic operations because they make nested call too....

kohske commented 10 years ago

But this is too expensive ;-p Getting wrong results without knowing is much worse than getting nothing...

If I'll implement, I'd choose the old magrittr way. But anyway I try to find better solution.

renkun-ken commented 10 years ago

Think the current best solution is to remember the syntax rules... :)

Or I have to shut down the feature that . is supported in first-argument piping. In this way, there won't be problem any more. :p

The ultimate tip is that Write in the simplest form allowed...

If you are allowed to write 1:3 %>>% sum and 1:3 %>>% sum(), don't write 1:3 %>>% sum(.), I see that . is written on purpose, adds no value, leads to unexpected results due to misuse.

An interesting problem with the title of this issue is that it is hard for the function to tell which usage is incorrect... due to a lack of knowledge of your purpose so it only can assume that you know the rules.

kohske commented 10 years ago

Think the current best solution is to remember the syntax rules... :)

Exactly :-)

Note that at the moment, using dot with first-argument piping always suffers from danger. The list.sort example, although it looks working correctly, the list.sort is actually called as list.sort(mtcars$mpg, .).

renkun-ken commented 10 years ago

:) yes, and list.sort() does not work with %>% in this simple case because %>% interprets it as the user wants to put the data here, so list.sort(mtcars$mpg) which does not work while list.sort() interprets . as the user wants to sort the numbers directly rather than after some transformation.

Non-standard evaluation allows different functions to have different interpretations of symbols and expressions. But when two interpretations clash, everything goes wrong. That's why pipeR tries to interpret as little as possible to avoid such interpretation clash.

yanlinlin82 commented 10 years ago

Is there any way or possible solution to check such problems before running the commands/procedures in R?

renkun-ken commented 10 years ago

@yanlinlin82, I'm thinking about it too. The problem is it is a problem for some functions, but it is not a problem for others. For sum() this usage is obviously a misuse, but for rlist::list.sort() and dplyr::do() it may not be a misuse at all because these functions have their own interpretation of . inside their function calls.

If we have to detect such seemingly problematic usages, the only way I come up with is to build a blacklist or white-list of functions to indicate whether such usage is correct or not. The idea is clearly very stupid because there are countless functions and even more combinations and possible nested calls.

If you have a smarter solution to detect such problem, please let me know, at least this can be a challenge. Clearly this is a tradeoff between features and people may have mixed feelings about it.

yanlinlin82 commented 10 years ago

I guess every function should be responsible for such checking (if it has to), since their interpretations are different. Considering the huge cost of performance, a pre-running parse should be ideal, especially supporting from the basic mechanism (I do not know if such thing exists or not).

If no, my so far best try would be using some kind of global variable to mark if the checking has been done (for current calling), which, however, may be too ugly to implement and also will reduce a little performance. For the latter, my experience in C/assembly would guide me to use directly a function pointer instead of the flag, which could make the code uglier :)

renkun-ken commented 10 years ago

I reopen this issue for further discussion.

renkun-ken commented 10 years ago

There seems no more discussion on this.