hoaproject / Regex

The Hoa\Regex library.
https://hoa-project.net/
310 stars 17 forks source link

Support internal options setting #10

Closed Hywan closed 9 years ago

Hywan commented 9 years ago

See http://pcre.org/pcre.txt. Quoting:

INTERNAL OPTION SETTING

The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and PCRE_EXTENDED options (which are Perl-compatible) can be changed from within the pattern by a sequence of Perl option letters enclosed between "(?" and ")". The option letters are

i for PCRE_CASELESS m for PCRE_MULTILINE s for PCRE_DOTALL x for PCRE_EXTENDED

For example, (?im) sets caseless, multiline matching. It is also possi- ble to unset these options by preceding the letter with a hyphen, and a combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted. If a letter appears both before and after the hyphen, the option is unset.

The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be changed in the same way as the Perl-compatible options by using the characters J, U and X respectively.

When one of these option changes occurs at top level (that is, not inside subpattern parentheses), the change applies to the remainder of the pattern that follows. If the change is placed right at the start of a pattern, PCRE extracts it into the global options (and it will there- fore show up in data extracted by the pcre_fullinfo() function).

An option change within a subpattern (see below for a description of subpatterns) affects only that part of the subpattern that follows it, so

(a(?i)b)c

matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used). By this means, options can be made to have different settings in different parts of the pattern. Any changes made in one alternative do carry on into subsequent branches within the same subpattern. For example,

(a(?i)b|c)

matches "ab", "aB", "c", and "C", even though when matching "C" the first branch is abandoned before the option setting. This is because the effects of option settings happen at compile time. There would be some very weird behaviour otherwise.

Note: There are other PCRE-specific options that can be set by the application when the compiling or matching functions are called. In some cases the pattern can contain special leading sequences such as (_CRLF) to override what the application has set or what has been defaulted. Details are given in the section entitled "Newline sequences" above. There are also the (_UTF8), (_UTF16),(_UTF32), and (_UCP) leading sequences that can be used to set UTF and Unicode prop- erty modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP options, respectively. The (_UTF) sequence is a generic version that can be used with any of the libraries. How- ever, the application can set the PCRE_NEVER_UTF option, which locks out the use of the (*UTF) sequences.