FileCheck regex should support common regex escape sequences

cjdb commented 10 months ago

clang\d is a pattern that's recognised by many regex engines to mean clang[0-9], but FileCheck doesn't seem to recognise it. It would be good to have FileCheck recognise the following patterns:

\f, \n, \r, \t, \v: usual escape sequences
\b: matches a word boundary
\d: equivalent to [0-9]
\s: equivalent to [ \f\n\r\t\v]
\w: equivalent to [A-Za-z0-9_]
\B: inverse of \b
\S: inverse of \s
\D: inverse of \d
\W: inverse of \w

The above are good for matching ASCII characters, but don't scale for anything that's outside of ASCII. If we're to add this feature, I think it would be good to produce a design that incorporates Unicode code points as well.

asl commented 10 months ago

The regex implementation available in lib/Support seems to support only POSIX-style regex'es. So, one could use [:digit:] instead of \d

Here is list of supported classes: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Support/regcomp.c#L58

cjdb commented 9 months ago

[:digit:] is longer than [0-9] which already interrupts readability when compared with \d. Further, these types of escapes are accepted by a large variety of regex engines, and it was surprising to learn that FileCheck doesn't support this (I spent a couple of hours debugging before swapping out \d with [0-9]).

If POSIX regex doesn't support this, then we should consider expanding to a style that supports both [:digit:] and \d.

llvm / llvm-project

FileCheck regex should support common regex escape sequences #78066