Open k-takata opened 7 years ago
First, I'd like to translate the important points of his paper very very roughly.
Complement sets are useful to express C-style comments, CR LF terminated lines, Python's """...""" strings, HTTP header terminated by CR LF CR LF, SMTP body terminated by CR LF . CR LF, etc. These can be expressed with the complement set of (any string)(terminator)(any string). However, it it not easy to construct a regex which matches such set. This proposes the absent operator which is easy to write and fits well to regular language theory.
Regular expressions consist of:
e
: empty string
c
: character
rr
: concatenation
r|r
: alternation
r*
: Kleene closure
A regular expressions engine can be implemented with DFA or backtracking. Script languages nowadays like Perl, Python and Ruby uses backtracking, because it is not easy to implement capturing with DFA. He implemented a basic backtracking engine using Ruby.
3.1. Abstract Syntax Tree of regular expressions
3.2. A basic regular expressions engine
A basic implementation by Ruby.
# re: AST of a regex
# str: An array of characters
# pos: Start position
# block: A block executed when it is matched
# (Callback)
def try(re, str, pos, &block)
If a regex engine uses DFA, complement sets can be handled easily. However it is not easy for backtracking engine. (See 7.3 for detail.)
(Note: Onigmo actually uses (?~r) instead of !r, because !r is not compatible with current syntax.) The language generated by !r can be defined as the following:
L(!r) = \Sigma * - L(.*r.*)
L(!r)
is a complement set of L(.*r.*)
.
This is also described in pp.26-27 of his slide: http://www.a-k-r.org/pub/prosym49-akr-presen.pdf See also the next comment.
7.1. Easy to write
C-style comments can be expressed with the following:
ab[^b]*b+(([^ba][^b]*)b+)*a
("a" and "b" are used here instead of "/" and "*" to reduce the complexity of escaping.)
If the absent operator is used, it can be:
ab!(ba)ba
7.2. Fit with regular language theory
7.2.1. Repetition of lazy match
ab.*?ba
This doesn't work well when concatenated with other regex. E.g.
ab.*?ba\n
This wrongly matches "/*2*/3/*4*/\n"
.
However,
ab!(ba)ba\n
this correctly matches "/*4*/\n"
.
7.2.2. No backtracking
ab(?>.*?ba)
This works well when concatenated with "\n"
:
ab(?>.*?ba)\n
However this hardly depends on the strategy of backtracking, and it doesn't have theoretical backgrounds. E.g. it cannot be converted to a DFA.
7.2.3. Negative lookahead
ab((?!ba).)*ba
This works well.
However, it is not possible to define a language L((?!r))
which is generated
by (?!r)
.
Note: I'm not sure that the negative lookahead is regular or not. There is a paper (in Japanese) which says that it is regular.
7.3. Inefficiency of complement sets
It is possible to implement a negation operator which matches the complement set of r. This is, however, inefficient to implement on a backtracking engine. The absent operator is much efficient than the negation operator.
8.1. Ragel
Ragel has the following operators:
!r
This is the same as the complement set operator discussed in 7.3.r1-r2
This is the same as the complement set operator when r1 generates all the possible strings.r1--r2
This is the same as the absent operator when r1 generates all the possible strings.8.2. Perl
8.3. Grail
And partial translation of his slide.
p.26 Behavior of the absent operator
Search - not match
Move start pos and search
Move again and search
Found
Move again and search
Found another one
Move and search
No need to search after this
The range which matches are not included.
p.27 Behavior of the absent operator
BTW, should the name of the operator be "absence operator" instead of "absent operator"?
Good work! I'd help translate them if needed.
BTW, should the name of the operator be "absence operator" instead of "absent operator"?
I think so. "absent operator" is a confusing name, because it could just as well refer to an operator that is not there. "absence operator" does not have this problem.
On the other hand, would you say that it is also a group? In the docs you have put it under "7. Extended groups". If so, it might be more consistent to use the adjective form, as with "passive" or "atomic". I'd still choose clarity over consistency, though.
Thank you for the explanation.
would you say that it is also a group?
In Akira's paper, !
is used for the operator. However, I decided to use (?~...)
in Onigmo, and yes, it is also a group (unlike !
).
So "absence operator" or "absence group" seems better than "absent operator".
I have slightly updated the document: 791140951eefcf17db4e762e789eb046ea8a114c I changed the name of the operator to "absence operator". (PR for fixing document (including grammar or typo) is always welcome.)
(Upvoting all comments is meaningless, I think...)
After the release of Ruby 2.4.1, some blog posts about the absent operator were written, and they caused some discussions.
Blog posts:
Discussions:
Unfortunately, some people might not understand the advantage of the absent operator. Maybe one of the reason is that the original paper by Tanaka Akira is written in Japanese. Another reason would be that the document of the operator in Onigmo is not enough.