intel / hyperscan

High-performance regular expression matching library
https://www.hyperscan.io
Other
4.78k stars 712 forks source link

Bugs in patterns such as [A-Za-z]+ #152

Closed genivia-inc closed 5 years ago

genivia-inc commented 5 years ago

Here is a simple example showing that Hyperscan fails on patterns such as [A-Za-z]+ using the unmodified simplegrep example provided with the Hyperscan 5.1.0 distribution:

cd /usr/local/Cellar/hyperscan/5.1.0/share/doc/hyperscan/examples
cc -I/usr/local/Cellar/hyperscan/5.1.0/include/hs -L/usr/local/Cellar/hyperscan/5.1.0/lib -lhs simplegrep.c
./a.out '[A-Za-z]+' simplegrep.c
Scanning 8033 bytes with Hyperscan
Match for pattern "[A-Za-z]+" at offset 7
Match for pattern "[A-Za-z]+" at offset 8
Match for pattern "[A-Za-z]+" at offset 9
Match for pattern "[A-Za-z]+" at offset 10
Match for pattern "[A-Za-z]+" at offset 11
Match for pattern "[A-Za-z]+" at offset 12
Match for pattern "[A-Za-z]+" at offset 13
Match for pattern "[A-Za-z]+" at offset 14
Match for pattern "[A-Za-z]+" at offset 15
Match for pattern "[A-Za-z]+" at offset 18

But the file simplegrep.c starts with:

/*
 * Copyright (c) 2015, Intel Corporation
 *

Hyperscan's hs_scan is used in simplegrep.c works for search words like Hyperscan but does not work for [A-Za-z]+ and other patterns.

I tried to use Hyperscan as an engine for lexical analysis, see https://github.com/Genivia/RE-flex/issues/22

xiangwang1 commented 5 years ago

I think these results are what we will expect. First of all, Hyperscan doesn't support greedy and ungreedy syntax, but reports all matches.

If making simplegre.c as an input to match, then the beginning of input looks like "/\n Copy" by chaining newline (\n) and space characters with readable characters, the first match will be "C" at offset 7, the second match will be "o" at offset 8...

genivia-inc commented 5 years ago

I like Hyperscan. But why is the example called simplegrep when it returns overlapping matches? The sparse overview of Hyperscan states that its syntax supports most of PCRE2 and errors are reported for unsupported constructs. The Hypserscan docs state that quantifiers such as ?, *, and + are supported. This is misleading to the audience as this is not described in the README of the examples, i.e. it does not behave like grep at all.

Let me explain that I'm just trying to use Hyperscan as a regex engine, trying to find ways around its behavior to make it work like one, for example in the RE/flex scanner generator. That would be nice, because we are making RE/flex scanners the fastest as possible. Right now our performance results show that Hyperscan is actually quite a bit slower so work has to be done to 1) use Hyperscan as a ERE-compliant regex engine (preferably ERE POSIX) and 2) improve its speed if possible.

xiangwang1 commented 5 years ago

Sorry for the confusion. We have a dedicated section called semantics which explains the differences at http://intel.github.io/hyperscan/dev-reference/compilation.html#pattern-support.

Hyperscan's performance is heavily tuned for network security use cases. The internal design principle of avoiding backtracking for performance considerations makes Hyperscan lack a compatible pattern support. The performance depends on the characteristics of both patterns and input. In your specific lexer testing case, the patterns, consisting of character classes and very short literals, generate a very high matching rate for Hyperscan on the cpp source file, which could impose extra overhead for Hyperscan.

genivia-inc commented 5 years ago

OK, makes sense. Thanks for the time explaining.

IMHO the section about semantics should be moved up to the introduction so it is clear what to expect when using Hyperscan, for what reason it differs to offer provides advantages for streaming matching i.e. to find all matches when searching blocks or streams. That is its main selling point!