k-takata / Onigmo

Onigmo is a regular expressions library forked from Oniguruma.
Other
620 stars 94 forks source link

Improve document about the absence operator #87

Open k-takata opened 7 years ago

k-takata commented 7 years ago

After the release of Ruby 2.4.1, some blog posts about the absent operator were written, and they caused some discussions.

Blog posts:

Discussions:

Unfortunately, some people might not understand the advantage of the absent operator. Maybe one of the reason is that the original paper by Tanaka Akira is written in Japanese. Another reason would be that the document of the operator in Onigmo is not enough.

k-takata commented 7 years ago

First, I'd like to translate the important points of his paper very very roughly.


  1. Introduction

Complement sets are useful to express C-style comments, CR LF terminated lines, Python's """...""" strings, HTTP header terminated by CR LF CR LF, SMTP body terminated by CR LF . CR LF, etc. These can be expressed with the complement set of (any string)(terminator)(any string). However, it it not easy to construct a regex which matches such set. This proposes the absent operator which is easy to write and fits well to regular language theory.

  1. Regular expressions

Regular expressions consist of: e: empty string c: character rr: concatenation r|r: alternation r*: Kleene closure

  1. Implementation of regular expressions engine

A regular expressions engine can be implemented with DFA or backtracking. Script languages nowadays like Perl, Python and Ruby uses backtracking, because it is not easy to implement capturing with DFA. He implemented a basic backtracking engine using Ruby.

3.1. Abstract Syntax Tree of regular expressions

3.2. A basic regular expressions engine

A basic implementation by Ruby.

# re: AST of a regex
# str: An array of characters
# pos: Start position
# block: A block executed when it is matched
#        (Callback)
def try(re, str, pos, &block)
  1. Proposal for the absent operator

If a regex engine uses DFA, complement sets can be handled easily. However it is not easy for backtracking engine. (See 7.3 for detail.)

  1. The absent operator !r

(Note: Onigmo actually uses (?~r) instead of !r, because !r is not compatible with current syntax.) The language generated by !r can be defined as the following:

L(!r) = \Sigma * - L(.*r.*)

L(!r) is a complement set of L(.*r.*).

  1. Implementation of the absent operator

This is also described in pp.26-27 of his slide: http://www.a-k-r.org/pub/prosym49-akr-presen.pdf See also the next comment.

  1. Advantage of the absent operator

7.1. Easy to write

C-style comments can be expressed with the following:

ab[^b]*b+(([^ba][^b]*)b+)*a

("a" and "b" are used here instead of "/" and "*" to reduce the complexity of escaping.)

If the absent operator is used, it can be:

ab!(ba)ba

7.2. Fit with regular language theory

7.2.1. Repetition of lazy match

ab.*?ba

This doesn't work well when concatenated with other regex. E.g.

ab.*?ba\n

This wrongly matches "/*2*/3/*4*/\n". However,

ab!(ba)ba\n

this correctly matches "/*4*/\n".

7.2.2. No backtracking

ab(?>.*?ba)

This works well when concatenated with "\n":

ab(?>.*?ba)\n

However this hardly depends on the strategy of backtracking, and it doesn't have theoretical backgrounds. E.g. it cannot be converted to a DFA.

7.2.3. Negative lookahead

ab((?!ba).)*ba

This works well. However, it is not possible to define a language L((?!r)) which is generated by (?!r).

Note: I'm not sure that the negative lookahead is regular or not. There is a paper (in Japanese) which says that it is regular.

7.3. Inefficiency of complement sets

It is possible to implement a negation operator which matches the complement set of r. This is, however, inefficient to implement on a backtracking engine. The absent operator is much efficient than the negation operator.

  1. Related researches

8.1. Ragel

Ragel has the following operators:

8.2. Perl

8.3. Grail

  1. Conclusions
k-takata commented 7 years ago

And partial translation of his slide.


p.26 Behavior of the absent operator

Search - not match
  Move start pos and search
    Move again and search
      Found
      Move again and search
        Found another one
      Move and search
        No need to search after this

The range which matches are not included.

p.27 Behavior of the absent operator

k-takata commented 7 years ago

BTW, should the name of the operator be "absence operator" instead of "absent operator"?

hachi8833 commented 7 years ago

Good work! I'd help translate them if needed.

jaynetics commented 7 years ago

BTW, should the name of the operator be "absence operator" instead of "absent operator"?

I think so. "absent operator" is a confusing name, because it could just as well refer to an operator that is not there. "absence operator" does not have this problem.

On the other hand, would you say that it is also a group? In the docs you have put it under "7. Extended groups". If so, it might be more consistent to use the adjective form, as with "passive" or "atomic". I'd still choose clarity over consistency, though.

k-takata commented 7 years ago

Thank you for the explanation.

would you say that it is also a group?

In Akira's paper, ! is used for the operator. However, I decided to use (?~...) in Onigmo, and yes, it is also a group (unlike !). So "absence operator" or "absence group" seems better than "absent operator".

k-takata commented 7 years ago

I have slightly updated the document: 791140951eefcf17db4e762e789eb046ea8a114c I changed the name of the operator to "absence operator". (PR for fixing document (including grammar or typo) is always welcome.)

k-takata commented 7 years ago

(Upvoting all comments is meaningless, I think...)