joddie / pcre2el

convert between PCRE, Emacs and rx regexp syntax
GNU General Public License v3.0
242 stars 25 forks source link

** Overview =pcre2el= or =rxt= (RegeXp Translator or RegeXp Tools) is a utility for working with regular expressions in Emacs, based on a recursive-descent parser for regexp syntax. In addition to converting (a subset of) PCRE syntax into its Emacs equivalent, it can do the following:

** Usage Enable =rxt-mode= or its global equivalent =rxt-global-mode= to get the default key-bindings. There are three sets of commands: commands that take a PCRE regexp, commands which take an Emacs regexp, and commands that try to do the right thing based on the current mode. Currently, this means Emacs syntax in =emacs-lisp-mode= and =lisp-interaction-mode=, and PCRE syntax everywhere else.

The default key bindings all begin with =C-c /= and have a mnemonic structure: =C-c / =, or just =C-c / = for the "do what I mean" commands. The complete list of key bindings is given here and explained in more detail below:

*** Interactive input and output When used interactively, the conversion commands can read a regexp either from the current buffer or from the minibuffer. The output is displayed in the minibuffer and copied to the kill-ring.

- When called with a prefix argument (=C-u=), they read a regular
  expression from the minibuffer literally, without further
  processing -- meaning there's no need to double the backslashes if
  it's an Emacs regexp.  This is the same way commands like
  =query-replace-regexp= read input.

- When the region is active, they use they the region contents,
  again literally (without any translation of string syntax).

- With neither a prefix arg nor an active region, the behavior
  depends on whether the command expects an Emacs regexp or
  a PCRE one.

  Commands that take an Emacs regexp behave like =C-x C-e=: they
  evaluate the sexp before point (which could be simply a string
  literal) and use its value. This is designed for use in Elisp
  buffers. As a special case, if point is *inside* a string, it's
  first moved to the string end, so in practice they should work
  as long as point is somewhere within the regexp literal.

  Commands that take a PCRE regexp try to read a Perl-style
  delimited regex literal *after* point in the current buffer,
  including its flags. For example, putting point before the =m=
  in the following example and doing =C-c / p e=
  (=rxt-pcre-to-elisp=) displays =\(?:bar\|foo\)=, correctly
  stripping out the whitespace and comment:

  : $x =~ m/  foo   |  (?# comment) bar /x

  The PCRE reader currently only works with =/ ... /= delimiters. It
  will ignore any preceding =m=, =s=, or =qr= operator, as well as
  the replacement part of an =s= construction.

  Readers for other PCRE-using languages are on the TODO list.

The translation functions display their result in the minibuffer
and copy it to the kill ring. When translating something into
Elisp syntax, you might need to use the result either literally
(e.g. for interactive input to a command like
=query-replace-regexp=), or as a string to paste into Lisp code.
To allow both uses, =rxt-pcre-to-elisp= copies both versions
successively to the kill-ring. The literal regexp without string
quoting is the top element of the kill-ring, while the Lisp string
is the second-from-top. You can paste the literal regexp somewhere
by doing =C-y=, or the Lisp string by =C-y M-y=.

*** Syntax conversion commands =rxt-convert-syntax= (=C-c / c=) converts between Emacs and PCRE syntax, depending on the major mode in effect when called. Alternatively, you can specify the conversion direction explicitly by using either =rxt-pcre-to-elisp= (=C-c / p e=) or =rxt-elisp-to-pcre= (=C-c / e p=).

Similarly, =rxt-convert-to-rx= (=C-c / x=) converts either kind of
syntax to =rx= form, while =rxt-convert-pcre-to-rx= (=C-c / p x=)
and =rxt-convert-elisp-to-rx= (=C-c / e x=) convert to =rx= from a
specified source type.

In Elisp buffers, you can use =rxt-toggle-elisp-rx= (=C-c / t= or
=C-c / e t=) to switch the regexp at point back and forth between
string and =rx= syntax. Point should either be within an =rx= or
=rx-to-string= form or a string literal for this to work.

*** PCRE mode (experimental) If you want to use emulated PCRE regexp syntax in all Emacs commands, try =pcre-mode=, which uses Emacs's advice system to make all commands that read regexps using the minibuffer use emulated PCRE syntax. It should also work with Isearch.

This feature is still fairly experimental.  It may fail to work or
do the wrong thing with certain commands.  Please report bugs.

=pcre-query-replace-regexp= was originally defined to do
query-replace using emulated PCRE regexps, and is now made
somewhat obsolete by =pcre-mode=.  It is bound to =C-c / %= by
default, by analogy with =M-%=.  Put the following in your
=.emacs= if you want to use PCRE-style query replacement
everywhere:

: (global-set-key [(meta %)] 'pcre-query-replace-regexp)

*** Explain regexps When syntax-highlighting isn't enough to untangle some gnarly regexp you find in the wild, try the 'explain' commands: =rxt-explain= (=C-c / /=), =rxt-explain-pcre= (=C-c / p=) and =rxt-explain-elisp= (=C-c / e=). These display the original regexp along with its pretty-printed =rx= equivalent in a new buffer. Moving point around either in the original regexp or the =rx= translation highlights corresponding pieces of syntax, which can aid in seeing things like the scope of quantifiers.

I call them "explain" commands because the =rx= form is close to a
plain syntax tree, and this plus the wordiness of the operators
usually helps to clarify what is going on.  People who dislike
Lisp syntax might disagree with this assessment.

*** Generate all matching strings (productions) Occasionally you come across a regexp which is designed to match a finite set of strings, e.g. a set of keywords, and it would be useful to recover the original set. (In Emacs you can generate such regexps using =regexp-opt=). The commands =rxt-convert-to-strings= (=C-c / ′=), =rxt-pcre-to-strings= (=C-c / p ′=) or =rxt-elisp-to-strings= (=C-c / e ′=) accomplish this by generating all the matching strings ("productions") of a regexp. (The productions are copied to the kill ring as a Lisp list).

An example in Lisp code:

: (regexp-opt '("cat" "caterpillar" "catatonic")) : ;; => "\(?:cat\(?:atonic\|erpillar\)?\)" : (rxt-elisp-to-strings "\(?:cat\(?:atonic\|erpillar\)?\)") : ;; => '("cat" "caterpillar" "catatonic")

For obvious reasons, these commands only work with regexps that
don't include any unbounded quantifiers like =+= or =*=. They also
can't enumerate all the characters that match a named character
class like =[[:alnum:]]=. In either case they will give a (hopefully
meaningful) error message. Due to the nature of permutations, it's
still possible for a finite regexp to generate a huge number of
productions, which will eat memory and slow down your Emacs. Be
ready with =C-g= if necessary.

*** RE-Builder support The Emacs RE-Builder is a useful visual tool which allows using several different built-in syntaxes via =reb-change-syntax= (=C-c TAB=). It supports Elisp read and literal syntax and =rx=, but it can only convert from the symbolic forms to Elisp, not the other way. This package hacks the RE-Builder to also work with emulated PCRE syntax, and to convert transparently between Elisp, PCRE and rx syntaxes. PCRE mode reads a delimited Perl-like literal of the form =/ ... /=, and it should correctly support using the =x= and =s= flags.

*** Use from Lisp

Example of using the conversion functions: : (rxt-pcre-to-elisp "(abc|def)\w+\d+") : ;; => "\(\(?:abc\|def\)\)[_[:alnum:]]+[[:digit:]]+"

All the conversion functions take a single string argument, the regexp to translate:

** Bugs and Limitations *** Limitations on PCRE syntax PCRE has a complicated syntax and semantics, only some of which can be translated into Elisp. The following subset of PCRE should be correctly parsed and converted:

- parenthesis grouping =( .. )=, including shy matches =(?: ... )=
- backreferences (various syntaxes), but only up to 9 per expression    
- alternation =|=
- greedy and non-greedy quantifiers =*=, =*?=, =+=, =+?=, =?= and =??=
  (all of which are the same in Elisp as in PCRE)
- numerical quantifiers ={M,N}=
- beginning/end of string =\A=, =\Z=
- string quoting =\Q .. \E=
- word boundaries =\b=, =\B= (these are the same in Elisp)
- single character escapes =\a=, =\c=, =\e=, =\f=, =\n=, =\r=,
  =\t=, =\x=, and =\octal digits= (but see below about non-ASCII
  characters)
- character classes =[...]= including Posix escapes
- character classes =\d=, =\D=, =\h=, =\H=, =\s=, =\S=, =\v=, =\V=
  both within character class brackets and outside
- word and non-word characters =\w= and =\W=
  (Emacs has the same syntax, but its meaning is different)
- =s= (single line) and =x= (extended syntax) flags, in regexp
  literals, or set within the expression via =(?xs-xs)= or =(?xs-xs:
  .... )= syntax
- comments =(?# ... )=

Most of the more esoteric PCRE features can't really be supported
by simple translation to Elisp regexps. These include the
different lookaround assertions, conditionals, and the
"backtracking control verbs" =(* ...)= . OTOH, there are a few
other syntaxes which are currently unsupported and possibly could be:

- =\L=, =\U=, =\l=, =\u= case modifiers
- =\g{...}= backreferences

*** Other limitations

*** TODO:

** Internal details Internally, =rxt= defines an abstract syntax tree data type for regular expressions, parsers for Elisp and PCRE syntax, and "unparsers" from to PCRE, rx, and SRE syntax. Converting from a parsed syntax tree to Elisp syntax is a two-step process: first convert to =rx= form, then let =rx-to-string= do the heavy lifting. See =rxt-parse-re=, =rxt-adt->pcre=, =rxt-adt->rx=, and =rxt-adt->sre=, and the section beginning "Regexp ADT" in pcre2el.el for details.

This code is partially based on Olin Shivers' reference SRE implementation in scsh, although it is simplified in some respects and extended in others. See =scsh/re.scm=, =scsh/spencer.scm= and =scsh/posixstr.scm= in the =scsh= source tree for details. In particular, =pcre2el= steals the idea of an abstract data type for regular expressions and the general structure of the string regexp parser and unparser. The data types for character sets are extended in order to support symbolic translation between character set expressions without assuming a small (Latin1) character set. The string parser is also extended to parse a bigger variety of constructions, including POSIX character classes and various Emacs and Perl regexp assertions. Otherwise, only the bare minimum of scsh's abstract data type is implemented.

** Soapbox Emacs regexps have their annoyances, but it is worth getting used to them. The Emacs assertions for word boundaries, symbol boundaries, and syntax classes depending on the syntax of the mode in effect are especially useful. (PCRE has =\b= for word-boundary, but AFAIK it doesn't have separate assertions for beginning-of-word and end-of-word). Other things that might be done with huge regexps in other languages can be expressed more understandably in Elisp using combinations of `save-excursion' with the various searches (regexp, literal, skip-syntax-forward, sexp-movement functions, etc.).

There's not much point in using =rxt-pcre-to-elisp= to use PCRE notation in a Lisp program you're going to maintain, since you still have to double all the backslashes. Better to just use the converted result (or better yet, the =rx= form).

** History and acknowledgments This was originally created out of an answer to a stackoverflow question: http://stackoverflow.com/questions/9118183/elisp-mechanism-for-converting-pcre-regexps-to-emacs-regexps

Thanks to: