r-devel / r-project-sprint-2023

Material for the R project sprint
https://contributor.r-project.org/r-project-sprint-2023/
17 stars 3 forks source link

grep(pattern, ...) et al.: Optionally escalate warning on "argument 'pattern' has length > 1 ..." to an error #44

Open hturner opened 1 year ago

hturner commented 1 year ago

Discussed in https://github.com/r-devel/r-project-sprint-2023/discussions/8

Originally posted by **HenrikBengtsson** July 1, 2023 # Issue The `grep()` family of functions produces a warning if `pattern` specified more than one regular expression, e.g. ```r > pattern <- c("Volvo", "Fiat") > grep(pattern, rownames(mtcars)) [1] 32 Warning message: In grep(pattern, rownames(mtcars)) : argument 'pattern' has length > 1 and only the first element will be used ``` I cannot come up with a valid use case where using `length(pattern) != 1` is not a mistake. Because of this, I argue it's a bug. Because warnings might be ignored, or drown in the noise of other warnings, this is almost as problematic as a silent bug. There is a risk that there is code out there that produces incorrect results, without being noticed. I've checked R 4.3.1 and found the following functions with this lax behavior: In the **base** package: * [ ] `agrep()` * [ ] `agrepl()` * [ ] `grep()` * [ ] `grepl()` * [ ] `grepRaw()` * [ ] `regexpr()` * [ ] `gregexpr()` * [ ] `regexec()` * [ ] `gsub()` * [ ] `sub()` * [ ] ...? There might be more. # Suggestion ## For the sprint As a starter, introduce an "internal" environment variable, e.g. `_R_CHECK_LENGTH_1_PATTERN_`, that can be used to escalate the warning to an error: ```r > Sys.setenv("_R_CHECK_LENGTH_1_PATTERN_" = "true") > pattern <- c("Volvo", "Fiat") > grep(pattern, rownames(mtcars)) [1] 32 Error in grep(pattern, rownames(mtcars)) : argument 'pattern' has length != 1 ``` ## Long-term I'd argue that using `length(pattern) > 1` is a bug, similar to how `length(cond) > 1` is a bug in `if (cond) { ... }`. For the latter, we started out with `_R_CHECK_LENGTH_1_CONDITION_=true` as above in R (>= 3.4.0), but since that was too harsh to enable on CRAN, support for per-package checking was added as: ``` _R_CHECK_LENGTH_1_CONDITION_=package:_R_CHECK_PACKAGE_NAME_,verbose ``` in R (>= 4.0.0). That would only escalate to an error, if the buggy code was part of the package tested, otherwise it remained a warning. This allowed CRAN to have packages fixed one by one without breaking the package ecosystem. Eventually, enough packages were fixed, and they made `_R_CHECK_LENGTH_1_CONDITION_=true` the default. And in R (>= 4.2.0), erroring was the only outcome and `_R_CHECK_LENGTH_1_CONDITION_` could be dropped. A similar path forward can be made to fix `length(pattern) > 1` bugs, which means the next step after the above would be to add support for: ``` _R_CHECK_LENGTH_1_PATTERN_=package:_R_CHECK_PACKAGE_NAME_,verbose ``` There is no need to implement the latter during the sprint. Having `_R_CHECK_LENGTH_1_PATTERN_=true` will be a big step forward. Comment: I think it was @kalibera who implemented `_R_CHECK_LENGTH_1_CONDITION_=package:_R_CHECK_PACKAGE_NAME_,verbose`. PS. This proposal was taken from .