r-lib / rex

Friendly regular expressions for R.
https://rex.r-lib.org
Other
333 stars 27 forks source link

Case-insensitive regex? #85

Open MichaelChirico opened 1 year ago

MichaelChirico commented 1 year ago

Is there a way to specify a regex is case-insensitive in {rex}?

We are passing it to list.files(pattern=) so the normal arguments are not available -- the only approach would be to add (?i) AFAICT. Without it, the regex is substantially gnarlier:

rex(".", or(group(one_of("Rr"), or("", "html", "md", "nw", "rst", "tex", "txt")), "Qmd", "qmd"), end)
# vs (not exactly the same, but that's fine)
rex(".", ignore_case(or("r", "rhtml", "rmd", "qmd", "rnw", "rrst", "rtex", "rtxt")), end)
kevinushey commented 1 year ago

I don't think there is, but I agree this would be nice to have.

The main thing I'm not aware of -- does (?i) apply to the whole regular expression, or to just the following "piece", or something else? What's the best syntax to adopt here?

MichaelChirico commented 1 year ago

It can be de-activated with (?-i), e.g.

grepl("(?i)a(?-i)A", c("aa", "aA", "Aa", "AA"))
# [1] FALSE  TRUE FALSE  TRUE

From ?regex:

Perl-like matching can work in several modes, set by the options (?i)⁠ (caseless, equivalent to Perl's /i), ⁠(?m)⁠ (multiline, equivalent to Perl's /m⁠), ⁠(?s) (single line, so a dot matches all characters, even new lines: equivalent to Perl's /s⁠) and ⁠(?x) (extended, whitespace data characters are ignored unless escaped and comments are allowed: equivalent to Perl's /x⁠). These can be concatenated, so for example, ⁠(?im) sets caseless multiline matching. It is also possible to unset these options by preceding the letter with a hyphen, and to combine setting and unsetting such as ⁠(?im-sx)⁠. These settings can be applied within patterns, and then apply to the remainder of the pattern. Additional options not in Perl include (?U)⁠ to set ‘ungreedy’ mode (so matching is minimal unless ⁠?⁠ is used as part of the repetition quantifier, when it is greedy). Initially none of these options are set.

MichaelChirico commented 1 year ago

It also applies locally within a group:

grepl("((?i)a)A", c("aa", "aA", "Aa", "AA"))
# [1] FALSE  TRUE FALSE  TRUE
MichaelChirico commented 1 year ago

It looks like we can chain the modes, but only in perl=TRUE:

grepl("(?i)(?m)a.a(?-m)(?-i)", c("a\nA", "a\na", "a-A", "a-a"), perl = FALSE)
# Error in grepl("(?i)(?m)a.a(?-m)(?-i)", c("a\nA", "a\na", "a-A", "a-a"),  : 
#   invalid regular expression '(?i)(?m)a.a(?-m)(?-i)', reason 'Invalid regexp'
# In addition: Warning message:
# In grepl("(?i)(?m)a.a(?-m)(?-i)", c("a\nA", "a\na", "a-A", "a-a"),  :
#   TRE pattern compilation error 'Invalid regexp'
grepl("(?i)(?m)a.a(?-m)(?-i)", c("a\nA", "a\na", "a-A", "a-a"), perl = TRUE)
# [1] FALSE FALSE  TRUE  TRUE

So maybe the simplest implementation is (imitating rex:::group()):

ignore_case <- function(...) p("(?i)", p(escape_dots(...)), "(?-i)")

Alternatively (or perhaps additionally), we could unify an expression for modes:

regex_mode <- function(mode = c("ignore_case", "multiline", "single_line", "extended", "ungreedy"), ...) {
  mode <- unique(match.arg(mode, several.ok = TRUE))
  modes <- p(c(ignore_case = "i", multiline = "m", single_line = "s", extended = "x", ungreedy = "U")[mode])
  p("(?", modes, ")", p(escape_dots(...)), "(?-", modes, ")")
}

Or some other design that allows toggling modes on/off, like start_mode(c("ignore_case", "extended")) then end_mode("extended")...

MichaelChirico commented 1 year ago

Hmm, I see we have access to these through match(options = ) already:

https://github.com/r-lib/rex/blob/7148a0cb35793b421dc7a9a7f5534892241d7ae4/R/match.R#L145-L151

So we just need an interface to apply this directly to the regex, since we won't always be executing with matches(). But we should be consistent with the existing interface.