r-lib / rex

Friendly regular expressions for R.
https://rex.r-lib.org
Other
333 stars 27 forks source link

Wrong character class created from negated character classes #53

Open fangly opened 7 years ago

fangly commented 7 years ago

Hi,

I have issues when creating a new character class that combines several existing character classes including one or several ones that are negated (in rex 1.1.1).

  1. Creating a character class that contains negated and non-negated classes

    rx <- rex(one_of(digit, non_letter))  # [[:digit:]^[:alpha:]]
    re_matches("1", rx)  # TRUE
    re_matches("*", rx)  # FALSE (unexpected!)
    re_matches("a", rx)  # TRUE  (unexpected!)

    As far as I know this is not possible at all. The caret "^" must be directly after the opening bracket "[" for it to trigger a negation. I think combining negated and non-negated character classes should error, with an error message suggesting an alternative. In the example above, an alternative would be:

    rx <- rex(or(digit, non_letter))  # (?:[[:digit:]]|[^[:alpha:]])
  2. Creating a character class combining only negated classes

    rex(one_of(non_digit, non_lower))  # [^[:digit:]^[:lower:]]

    But the resulting regular expression should be "[^[:digit:][:lower:]]". Though the regular expression seems to work as intended, it would be safer to correct it.

Cheers, Florent

jimhester commented 7 years ago

Rex makes no attempt to verify a given regular expression is actually valid, I am pretty confident there are plenty of ways you can construct an invalid regular expression using it.

I agree it would be nice if constructs threw an error, but I am not sure it is worth complicating the implementation to support it.

fangly commented 7 years ago

You are right on all fronts, Jim. And in fact, I got errors from PCRE in some instances (malformed regular expressions). What I find problematic is a wrong result without any warning or error!

Based on my non-exhaustive knowledge of the rex package, I would suggest that only a single argument should be allowed for character class functions like one_of(...), any_of(...) or none_of(...) since correctness cannot be ensured (and is too difficult to reliably implement for all cases). Surely, users could still manually construct complex character classes using character_class(), or manipulate existing character classes with wildcards and boolean operations like maybe(), zero_or_more(), or() and not().