TRE regex engine as replacement for DeelX

RaiKoHoff commented 6 years ago

DeelX has been used to replace Scintilla's simple internal regex engine. DeelX is a full blown mostly POSIX compliant RegEx engine but lacks of reporting invalid regular expression. To overcome this problem (it says: No match found instead of invalid expression), I suggest to replace the engine by TRE (https://laurikari.net/tre/).

On beta channel, you will find a NP3 version with the new engine.

Remark: This engine is strictly POSIX compliant, which means: The EOL-meta char $ only matches the empty string immediately before Newline (LF,\n,Ascii 10) and not MS Windows default line-end (CR LF, \r\n). This may lead to unexpected behavior, for example replacing $ by X, will replace the empty string between CR and LF with X. Take a look at it by switching on "View->Show Line Endings". I asked the developer to add an option for enabling MS Windows Line Endings ... (https://github.com/laurikari/tre/issues/639).

lhmouse commented 6 years ago

The CR LF issue is confirmed to be on MSVC (as a text editor) too.

RaiKoHoff commented 6 years ago

@lhmouse : You are right - i didn't test it in MS VS. So we can accept this POSIX complient behavior? (The DeelX regexp engine handles the Windows line-ends as an unit, except they are specified as single char matches)

lhmouse commented 6 years ago

Well, if this isn't going to get fixed from upstream then I think it is barely satisfactory... It can be worked around: If the regex ends with a $ and we have got a match whose last character is \r (care must be taken not to read pass the beginning of the file) just drop it from the match string by decrementing the length.

RaiKoHoff commented 6 years ago

@lhmouse : first beta version available (v.2.17.1114.672) which is using Your workaround for "$ vs. CR LF" issue on MS Windows.

lhmouse commented 6 years ago

What branch is the corresponding one to this fixup in your forked repo? I think I can checkout it and build one from scratch myself.

lhmouse commented 6 years ago

In addition, does regex search work reliably on searching a$ from bbaa\r\n ?

RaiKoHoff commented 6 years ago

New version (2.17.1114.674) available on beta channel. @lhmouse : I fixed some issues around $ and \r\n problem. Some (known) issues still exist:

^ will not match the begin of the very first line, cause there is no Linefeed (LF) before the very first char to match the zero-length string.
$ will not match the end of the very last line (End of File), cause there is no Linefeed (LF) at the end to match the zero-length string before.

My forked NP3 Repo is: https://github.com/RaiKoHoff/Notepad3 the new regex engine development branch is: https://github.com/RaiKoHoff/Notepad3/tree/NewRegExEngine Take a look at the Regex Scintilla interfacing class: .../NewRegExEngine/scintilla/tre/TREgExprSearch.cxx The (slightly adapted) copy of the TRE Repo is at: https://github.com/RaiKoHoff/Notepad3/tree/NewRegExEngine/tre

lhmouse commented 6 years ago

^ does match the beginning of the first line here. o_O

RaiKoHoff commented 6 years ago

TRE has options to not match the begin (BOS) or end (EOS) of (partial) text to search in. I switched off these matches and tried to adjust text start (and end) before calling RegEx engine. Maybe it is a good idea to move that logic into the engine interface to set mentined options accordingly ... o_O

RaiKoHoff commented 6 years ago

After major refactoring of the find/replace range/all stuff, I put a new version (v.2.17.1115.674) on beta channel. Hopefully removed all issues regarding the new regex engine (TRE). I also tested/fixed a lot of side-effects, like replace/replace-all/replace-in-selection of strings and meta chars (^ , $), hopefully found most/all of the possible pitfalls. :-/

@lhmouse : the Repo Branch (NewRegExEngine) should be up to date.

lhmouse commented 6 years ago

When the pattern is an invalid regex, the text box has a red background, which is expected behavior. However, when the Find Next button is clicked, the error message is still 'the specified text was not found'.

Other than that, it looks good to me.

lhmouse commented 6 years ago

Looks like that look-ahead ((?=PATTERN) and ((?!PATTERN))) and look-behind ((?<=PATTERN) and (?<!PATTERN)) assertions are no longer supported, as well as non-capturing groups ((?:PATTERN)). Self-referencing (recursive) regex works as expected (e.g. (ab)\1 will match abab).

RaiKoHoff commented 6 years ago

@lhmouse : regarding the error message: In last version (.2.17.1115.674) it should be:

pressing Find Next|Previous , you should see:
pressing In Selection (w/o having any selection), you should see: The question is, if the "invalid regex" message should be triggered before this "invalid selection" message.
pressing In Selection (having a stream selection), you should see the invalid regex message box.

If not, please provide your workflow ...

lhmouse commented 6 years ago

I was looking at 2.17.1111.668:

commit 11cbd63df420ffb6a058720e92ca0d120239369e (HEAD -> NewRegExEngine, origin/NewRegExEngine)
Author: Rainer Kottenhoff <rainer.kottenhoff@gmail.com>
Date:   Wed Nov 15 00:25:04 2017 +0100

    +refactoring:  find/replace in range/all methods, according to new regex engine

You are right only if the caret is not at the end of file. If it is, I get 4005 then 8273 .

RaiKoHoff commented 6 years ago

@lhmouse : regarding the missing features, TRE's road map (https://github.com/laurikari/tre#roadmap) says:

These are other features I'm planning to implement real soon now:

... but, last commit has been years ago o_O ...

I found another regex lib with BSD license : RE2 by Google

https://opensource.googleblog.com/2010/03/re2-principled-approach-to-regular.html

https://github.com/google/re2 It seems to be faster, but lacks also of some features:

(?=re)  before text matching «re» NOT SUPPORTED
(?!re)  before text not matching «re» NOT SUPPORTED
(?<=re) after text matching «re» NOT SUPPORTED
(?<!re) after text not matching «re» NOT SUPPORTED

Full RE2 Syntax:


RE2 regular expression syntax reference
-------------------------------------

Single characters: . any character, possibly including newline (s=true) [xyz] character class [^xyz] negated character class \d Perl character class \D negated Perl character class [[:alpha:]] ASCII character class [[:^alpha:]] negated ASCII character class \pN Unicode character class (one-letter name) \p{Greek} Unicode character class \PN negated Unicode character class (one-letter name) \P{Greek} negated Unicode character class

Composites: xy «x» followed by «y» x|y «x» or «y» (prefer «x»)

Repetitions: x zero or more «x», prefer more x+ one or more «x», prefer more x? zero or one «x», prefer one x{n,m} «n» or «n»+1 or ... or «m» «x», prefer more x{n,} «n» or more «x», prefer more x{n} exactly «n» «x» x? zero or more «x», prefer fewer x+? one or more «x», prefer fewer x?? zero or one «x», prefer zero x{n,m}? «n» or «n»+1 or ... or «m» «x», prefer fewer x{n,}? «n» or more «x», prefer fewer x{n}? exactly «n» «x» x{} (== x) NOT SUPPORTED vim x{-} (== x?) NOT SUPPORTED vim x{-n} (== x{n}?) NOT SUPPORTED vim x= (== x?) NOT SUPPORTED vim

Implementation restriction: The counting forms «x{n,m}», «x{n,}», and «x{n}» reject forms that create a minimum or maximum repetition count above 1000. Unlimited repetitions are not subject to this restriction.

Possessive repetitions: x*+ zero or more «x», possessive NOT SUPPORTED x++ one or more «x», possessive NOT SUPPORTED x?+ zero or one «x», possessive NOT SUPPORTED x{n,m}+ «n» or ... or «m» «x», possessive NOT SUPPORTED x{n,}+ «n» or more «x», possessive NOT SUPPORTED x{n}+ exactly «n» «x», possessive NOT SUPPORTED

Grouping: (re) numbered capturing group (submatch) (?Pre) named & numbered capturing group (submatch) (?re) named & numbered capturing group (submatch) NOT SUPPORTED (?'name're) named & numbered capturing group (submatch) NOT SUPPORTED (?:re) non-capturing group (?flags) set flags within current group; non-capturing (?flags:re) set flags during re; non-capturing (?#text) comment NOT SUPPORTED (?|x|y|z) branch numbering reset NOT SUPPORTED (?>re) possessive match of «re» NOT SUPPORTED re@> possessive match of «re» NOT SUPPORTED vim %(re) non-capturing group NOT SUPPORTED vim

Flags: i case-insensitive (default false) m multi-line mode: «^» and «$» match begin/end line in addition to begin/end text (default false) s let «.» match «\n» (default false) U ungreedy: swap meaning of «x» and «x?», «x+» and «x+?», etc (default false) Flag syntax is «xyz» (set) or «-xyz» (clear) or «xy-z» (set «xy», clear «z»).

Empty strings: ^ at beginning of text or line («m»=true) $ at end of text (like «\z» not «\Z») or line («m»=true) \A at beginning of text \b at ASCII word boundary («\w» on one side and «\W», «\A», or «\z» on the other) \B not at ASCII word boundary \G at beginning of subtext being searched NOT SUPPORTED pcre \G at end of last match NOT SUPPORTED perl \Z at end of text, or before newline at end of text NOT SUPPORTED \z at end of text (?=re) before text matching «re» NOT SUPPORTED (?!re) before text not matching «re» NOT SUPPORTED (?<=re) after text matching «re» NOT SUPPORTED (?<!re) after text not matching «re» NOT SUPPORTED re& before text matching «re» NOT SUPPORTED vim re@= before text matching «re» NOT SUPPORTED vim re@! before text not matching «re» NOT SUPPORTED vim re@<= after text matching «re» NOT SUPPORTED vim re@<! after text not matching «re» NOT SUPPORTED vim \zs sets start of match (= \K) NOT SUPPORTED vim \ze sets end of match NOT SUPPORTED vim \%^ beginning of file NOT SUPPORTED vim \%$ end of file NOT SUPPORTED vim \%V on screen NOT SUPPORTED vim \%# cursor position NOT SUPPORTED vim \%'m mark «m» position NOT SUPPORTED vim \%23l in line 23 NOT SUPPORTED vim \%23c in column 23 NOT SUPPORTED vim \%23v in virtual column 23 NOT SUPPORTED vim

Escape sequences: \a bell (== \007) \f form feed (== \014) \t horizontal tab (== \011) \n newline (== \012) \r carriage return (== \015) \v vertical tab character (== \013) * literal «», for any punctuation character «» \123 octal character code (up to three digits) \x7F hex character code (exactly two digits) \x{10FFFF} hex character code \C match a single byte even in UTF-8 mode \Q...\E literal text «...» even if «...» has punctuation

\1 backreference NOT SUPPORTED \b backspace NOT SUPPORTED (use «\010») \cK control char ^K NOT SUPPORTED (use «\001» etc) \e escape NOT SUPPORTED (use «\033») \g1 backreference NOT SUPPORTED \g{1} backreference NOT SUPPORTED \g{+1} backreference NOT SUPPORTED \g{-1} backreference NOT SUPPORTED \g{name} named backreference NOT SUPPORTED \g subroutine call NOT SUPPORTED \g'name' subroutine call NOT SUPPORTED \k named backreference NOT SUPPORTED \k'name' named backreference NOT SUPPORTED \lX lowercase «X» NOT SUPPORTED \ux uppercase «x» NOT SUPPORTED \L...\E lowercase text «...» NOT SUPPORTED \K reset beginning of «$0» NOT SUPPORTED \N{name} named Unicode character NOT SUPPORTED \R line break NOT SUPPORTED \U...\E upper case text «...» NOT SUPPORTED \X extended Unicode sequence NOT SUPPORTED

\%d123 decimal character 123 NOT SUPPORTED vim \%xFF hex character FF NOT SUPPORTED vim \%o123 octal character 123 NOT SUPPORTED vim \%u1234 Unicode character 0x1234 NOT SUPPORTED vim \%U12345678 Unicode character 0x12345678 NOT SUPPORTED vim

Character class elements: x single character A-Z character range (inclusive) \d Perl character class [:foo:] ASCII character class «foo» \p{Foo} Unicode character class «Foo» \pF Unicode character class «F» (one-letter name)

Named character classes as character class elements: [\d] digits (== \d) [^\d] not digits (== \D) [\D] not digits (== \D) [^\D] not not digits (== \d) [[:name:]] named ASCII class inside character class (== [:name:]) [^[:name:]] named ASCII class inside negated character class (== [:^name:]) [\p{Name}] named Unicode property inside character class (== \p{Name}) [^\p{Name}] named Unicode property inside negated character class (== \P{Name})

Perl character classes (all ASCII-only): \d digits (== [0-9]) \D not digits (== [^0-9]) \s whitespace (== [\t\n\f\r ]) \S not whitespace (== [^\t\n\f\r ]) \w word characters (== [0-9A-Za-z]) \W not word characters (== [^0-9A-Za-z])

\h horizontal space NOT SUPPORTED \H not horizontal space NOT SUPPORTED \v vertical space NOT SUPPORTED \V not vertical space NOT SUPPORTED

ASCII character classes: [[:alnum:]] alphanumeric (== [0-9A-Za-z]) [[:alpha:]] alphabetic (== [A-Za-z]) [[:ascii:]] ASCII (== [\x00-\x7F]) [[:blank:]] blank (== [\t ]) [[:cntrl:]] control (== [\x00-\x1F\x7F]) [[:digit:]] digits (== [0-9]) [[:graph:]] graphical (== [!-~] == [A-Za-z0-9!"#$%&'()*+,-./:;<=>?@[\]^{|}~]) [[:lower:]] lower case (== [a-z]) [[:print:]] printable (== [ -~] == [ [:graph:]]) [[:punct:]] punctuation (== [!-/:-@[-{-~]) [[:space:]] whitespace (== [\t\n\v\f\r ]) [[:upper:]] upper case (== [A-Z]) [[:word:]] word characters (== [0-9A-Za-z]) [[:xdigit:]] hex digit (== [0-9A-Fa-f])

Unicode character class names--general category: C other Cc control Cf format Cn unassigned code points NOT SUPPORTED Co private use Cs surrogate L letter LC cased letter NOT SUPPORTED L& cased letter NOT SUPPORTED Ll lowercase letter Lm modifier letter Lo other letter Lt titlecase letter Lu uppercase letter M mark Mc spacing mark Me enclosing mark Mn non-spacing mark N number Nd decimal number Nl letter number No other number P punctuation Pc connector punctuation Pd dash punctuation Pe close punctuation Pf final punctuation Pi initial punctuation Po other punctuation Ps open punctuation S symbol Sc currency symbol Sk modifier symbol Sm math symbol So other symbol Z separator Zl line separator Zp paragraph separator Zs space separator

Vim character classes: \i identifier character NOT SUPPORTED vim \I «\i» except digits NOT SUPPORTED vim \k keyword character NOT SUPPORTED vim \K «\k» except digits NOT SUPPORTED vim \f file name character NOT SUPPORTED vim \F «\f» except digits NOT SUPPORTED vim \p printable character NOT SUPPORTED vim \P «\p» except digits NOT SUPPORTED vim \s whitespace character (== [ \t]) NOT SUPPORTED vim \S non-white space character (== [^ \t]) NOT SUPPORTED vim \d digits (== [0-9]) vim \D not «\d» vim \x hex digits (== [0-9A-Fa-f]) NOT SUPPORTED vim \X not «\x» NOT SUPPORTED vim \o octal digits (== [0-7]) NOT SUPPORTED vim \O not «\o» NOT SUPPORTED vim \w word character vim \W not «\w» vim \h head of word character NOT SUPPORTED vim \H not «\h» NOT SUPPORTED vim \a alphabetic NOT SUPPORTED vim \A not «\a» NOT SUPPORTED vim \l lowercase NOT SUPPORTED vim \L not lowercase NOT SUPPORTED vim \u uppercase NOT SUPPORTED vim \U not uppercase NOT SUPPORTED vim _x «\x» plus newline, for any «x» NOT SUPPORTED vim

Vim flags: \c ignore case NOT SUPPORTED vim \C match case NOT SUPPORTED vim \m magic NOT SUPPORTED vim \M nomagic NOT SUPPORTED vim \v verymagic NOT SUPPORTED vim \V verynomagic NOT SUPPORTED vim \Z ignore differences in Unicode combining characters NOT SUPPORTED vim

Magic: (?{code}) arbitrary Perl code NOT SUPPORTED perl (??{code}) postponed arbitrary Perl code NOT SUPPORTED perl (?n) recursive call to regexp capturing group «n» NOT SUPPORTED (?+n) recursive call to relative group «+n» NOT SUPPORTED (?-n) recursive call to relative group «-n» NOT SUPPORTED (?C) PCRE callout NOT SUPPORTED pcre (?R) recursive call to entire regexp (== (?0)) NOT SUPPORTED (?&name) recursive call to named group NOT SUPPORTED (?P=name) named backreference NOT SUPPORTED (?P>name) recursive call to named group NOT SUPPORTED (?(cond)true|false) conditional branch NOT SUPPORTED (?(cond)true) conditional branch NOT SUPPORTED (ACCEPT) make regexps more like Prolog NOT SUPPORTED (COMMIT) NOT SUPPORTED (F) NOT SUPPORTED (FAIL) NOT SUPPORTED (MARK) NOT SUPPORTED (PRUNE) NOT SUPPORTED (SKIP) NOT SUPPORTED (THEN) NOT SUPPORTED (ANY) set newline convention NOT SUPPORTED (ANYCRLF) NOT SUPPORTED (CR) NOT SUPPORTED (CRLF) NOT SUPPORTED (LF) NOT SUPPORTED (BSR_ANYCRLF) set \R convention NOT SUPPORTED pcre (*BSR_UNICODE) NOT SUPPORTED pcre

data-man commented 6 years ago

@RaiKoHoff https://github.com/kkos/oniguruma https://github.com/k-takata/Onigmo https://github.com/jpcre2/jpcre2 :)

RaiKoHoff commented 6 years ago

@data-man : you forgot https://github.com/intel/hyperscan ;-) What is your preferred regex engine ?

data-man commented 6 years ago

@RaiKoHoff PCRE/PCRE2 because a RE-syntax is compatible with Perl.

RaiKoHoff commented 6 years ago

Mmmhhh... http://sljit.sourceforge.net/regex_compare.html http://sljit.sourceforge.net/regex_perf.html From view point of design and speed, i would prefer Google's RE2 - but it lacks support (by design goal of reducing exponential run-time and stack usage) of backtracking features. If we don't really need them in NP3, i would go for it (Perl-Mode, not POSIX-Mode). ???

RaiKoHoff commented 6 years ago

Feel free to test minor update (v._2.17.1115.675) on beta channel, still using TRE engine ... ;-) (By the way, the fuzzy matcher in TRE (unique selling point) is opt out (yet).)

lhmouse commented 6 years ago

how about this problem ?

rizonesoft commented 6 years ago

@RaiKoHoff I would say, go for it.

RaiKoHoff commented 6 years ago

@lhmouse : this problem should be solved in (v._2.17.1115.675).

RaiKoHoff commented 6 years ago

I put both (DeelX and TRE) comparable NP3 versions on beta channel. (don't care for broken build number, About Dlg shows RegEx name in Scintilla version info) (or build yourself : https://github.com/RaiKoHoff/Notepad3/tree/OldRegExDeelX)

RaiKoHoff commented 6 years ago

New versions are online on beta channel (v.2.17.1116.680) in three (3) regex flavors (DeelX,TRE,RE2)

RaiKoHoff commented 6 years ago

New versions are online on beta channel (v.2.17.1116.680) in three (3) regex flavors (DeelX,TRE,RE2) Why RE2: https://github.com/google/re2/wiki/WhyRE2 Advantages beside above:

active development
only one (1) issue reported (today: 2017-11-16)
modern C++ interface Disadvantages:
lack of (rarely used) Perl features
(backreferences and look-around assertions are not supported) (I don't see this really as a disadvantage o_O)
it blows up the binary a bit :-/ No opinion yet:
fast on big data (?), does not matter that much in NP3

rizonesoft commented 6 years ago

@RaiKoHoff @lhmouse @data-man I would like to create a Notepad3 release a Friday, a week from now. Personally I would say we go for the RE2 engine, but this is not up to me.

Thank you guys for all the hard work and testing.

lhmouse commented 6 years ago

Backreferences are rarely used. But I think look-ahead and look-behind assertions are very helpful. It would be a pity to drop them.

I didn't see documentation about references of match groups when replacing text on https://github.com/google/re2/wiki/Syntax. Some people prefer $1, $2 etc while others prefer \1, \2, etc. Personally I prefer the latter but a number of regex engines support both.

rizonesoft commented 6 years ago

@lhmouse Going for the stable RE2 engine makes sense to me. It will avoid many bugs in the future.

RaiKoHoff commented 6 years ago

I prepare a PR for the current DeelX regex engine, having fixes to some issues ... DeelX:

look-ahead and look-behind assertions
Backreferences
supporting both types of references of match groups (\0-9, $0-9)
Disadvantage: fault tolerant to invalid regex pattern (you will rarely see red background, just no match) I am not sure about the new engines - got no gut feeling yet ... :-/ Version (2.17.1117.675) ready on beta channel ...

RaiKoHoff commented 6 years ago

I put a NP3 prototype (v.2.17.1122.678) on beta channel (dir: ApproximateSearch_TRE) having two (2) regex engines inside (DeelX and TRE), trying to enable TRE's unique "Approximate Matching" (fuzzy matching) algorithm. If the "mark all occurrences immediately" feature is ACTIVE and the fuzzy value is not exact (100%) and the file is large, you may encounter some slow responsiveness of the UI. Try it, like it or hate it - every feedback is welcome (please keep in mind, that this is a beta!). By the way, the 2nd regex engine (TRE) does not blow up NP3 executable that much (it is pure C code).

RaiKoHoff commented 6 years ago

@lhmouse : the branch can be found at https://github.com/RaiKoHoff/Notepad3/tree/Fuzzy_Matching

craigo- commented 6 years ago

A couple of things...

(Notepad3 64-bit 2.17.1120.677, DeelX)

(Warning: I am a RegEx noob!)

(1) RegEx Syntax

I'm coming to the party late, but I am used to the old syntax... e.g. (in Notepad2-mod):

In the specified version of Notepad3 (DeelX), the same search string returns no hits (or the string is simply invalid):

I have to double-escape the backslashes to make them work:

Is this how it is supposed to work? The built-in RegEx documentation still mentions single slashes... Is this perhaps something to do with the "Transform backslashes" tickbox being forced on when enabling Regular expression searching?

(2) Spammed by "The specified text was not found" popups

I'm having great difficulty even typing a backslash in the search field (in RegEx mode). I get loads of "The specified text was not found" popups.

Steps to reproduce:

Open text file
Ctrl-F
(If there is something on the clipboard, it populates the Search String field with its contents. Delete.)
Tick "Regular expression search"
Start the RegEx search string with a backslash

You immediately get a whole bunch of popups:

lhmouse commented 6 years ago

The 'transform backslashes' option should be forced OFF in case of regex search as regex engines should handle them already. The text-not-found spamming issue is confirmed here.

RaiKoHoff commented 6 years ago

Thanks for testing guys :-) @lhmouse is right, 'transform backslashes' is handled by regex engines themselves. This is the reason, I disabled the 'Transform backslashes' option in case of regex search (wildcard search is based on regex) and checked it in the disabled option, to indicate that BSs are handled accordingly. @craigo- : You are right too: the regex search should work with single BS. Unfortunately i introduced a bug, which handled BS double :-/. This is fixed in version v.2.17.1122.679, available on beta channel. The "text-not-found spamming issue" was closely related to it :-/ - fixed too.

RaiKoHoff commented 6 years ago

Two flavors (std and approx-search-trial) of v.2.17.1122.682 are online on beta channel (latest fixes).

craigo- commented 6 years ago

Thanks, @RaiKoHoff. I can confirm that the RegEx problems I described above are fixed in betas 679 and 682.

RaiKoHoff commented 6 years ago

Staying with the DeelX regex engine, I am closing this issue for now ...

rizonesoft / Notepad3

TRE regex engine as replacement for DeelX #175