compile.cc should not mark byte range for \b.

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

diff -r e50682ab1eb9 re2/compile.cc
--- a/re2/compile.cc    Tue Oct 29 13:28:32 2013 +0800
+++ b/re2/compile.cc    Wed Oct 30 04:04:35 2013 +0800
@@ -434,12 +434,21 @@
   if (empty & (kEmptyBeginLine|kEmptyEndLine))
     prog_->MarkByteRange('\n', '\n');
   if (empty & (kEmptyWordBoundary|kEmptyNonWordBoundary)) {
+    // DFA use kFlagLastWord, OnePass use Prog::EmptyFlags,
+    // it's unnecessary to build this bytemap.
+    // XXX: without marking any range,
+    // search_test fails on such r"\b", r"\b\b\b",
+    // i just mark the most common \n to prevent it.
+    // maybe the optimizer? i will dig more.
+    prog_->MarkByteRange('\n', '\n');
+    /*
     int j;
     for (int i = 0; i < 256; i = j) {
       for (j = i+1; j < 256 && Prog::IsWordChar(i) == Prog::IsWordChar(j); j++)
         ;
       prog_->MarkByteRange(i, j-1);
     }
+    */
   }
   return Frag(id, PatchList::Mk(id << 1));
 }

Please provide any additional information below.
NOTE: If you have a suggested patch, please see
http://code.google.com/p/re2/wiki/Contribute
for information about sending it in for review.  Thanks.

Original issue reported on code.google.com by Lyricconch on 29 Oct 2013 at 8:11

GoogleCodeExporter commented 9 years ago

I believe the code is correct as is, and that your change would make tests fail.
From your comment it sounds like your change DOES make tests fail.

Original comment by rsc@golang.org on 29 Oct 2013 at 8:15

Changed state: WorkingAsIntended

GoogleCodeExporter commented 9 years ago

with
"prog_->MarkByteRange('\n', '\n');".
tests are all pass.

i am digging how to pass tests without marking any characters.
the \b should not require any byterange_ set.

Original comment by Lyricconch on 29 Oct 2013 at 9:38

GoogleCodeExporter commented 9 years ago

It is necessary to mark the boundaries so that (for example) bytemap_['@'] and 
bytemap_['A'] are different, so that next_[bytemap_['@']] and 
next_[bytemap_['A']] are different, so that if you compute next_[bytemap_['@']] 
using @ as the next character, you do not attempt to reuse that work if you see 
an A in the same state.

Original comment by rsc@golang.org on 31 Oct 2013 at 6:51

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

the flag_ of DState is used to trading DState size with DState count.
on first look, it seems not good, a bit in flag_ will cause DState count 
doubled,
but DState size is fixed while DState count can be cut by insts_, 
with unused flag those states can be merged into one.

i think it may worth to move utf-8 handling out of DFA,
just like byterange_, we can collect meta information from regex compile time.
the InlinedSearchLoop just calculate those we care, and emit "Property of Rune" 
instead bytemap_[c].

these cause "MB/s" much lower while DFA is full-state cached
(900MB/s is very nice, but should run regex enough time), 
but should greatly improves DFA startup time and DFA memory usage(the "us/op").
(DFA full-cache requires bytemap_range_ * bytemap_range_ / 2 entitis per 
Rune(support 2 bytes per Rune), 
while "property" only require "kinds of exclusive property" entities per Rune)

Original comment by Lyricconch on 2 Nov 2013 at 9:20

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

there is a kFlagLastWord in DState::flag_.
so that '@' or 'A' will come from different state according the prev char of 
('@' or 'A'). 
with different state they come from, splitting them from bytemap_ is no longer 
necessary.

the only exception is the first byte of DFA::RunOneByteOnState
(at this point, 'A' or '@' share the same start state).

Original comment by Lyricconch on 2 Nov 2013 at 9:26

lamuguo / re2

compile.cc should not mark byte range for \b. #96