Infinite loop when using a blank regular expression

mudge / re2

Ruby bindings to RE2, a "fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python".

http://mudge.name/re2/

BSD 3-Clause "New" or "Revised" License

129 stars 13 forks source link

Infinite loop when using a blank regular expression #32

Closed stanhu closed 7 years ago

stanhu commented 7 years ago

Here's a simple test case which causes an infinite loop:

regex = RE2::Regexp.new('')
regex.scan('test').map { |match| match }

This also fails:

regex = RE2::Regexp.new('()')
regex.scan('test').map { |match| match }

In Ruby, this works as follows:

'test'.scan('').map { |match| match }
=> ["", "", "", "", ""]

Notes from the re2.h file about this: https://github.com/google/re2/blob/master/re2/re2.h#L133-L138

Relates to these lines: https://github.com/mudge/re2/blob/v1.0.0/ext/re2/re2.cc#L224-L225

stanhu commented 7 years ago

It looks like this is a problem with the implementation of Scanner: it loops indefinitely as long as there are matches even if the input wasn't advanced.

Correct form:

r = RE2::Regexp.new('')
scanner = r.scan('test')
scanner.scan.to_a

Incorrect form:

r = RE2::Regexp.new('')
r.scan('test').map { |x| x }

mudge commented 7 years ago

Great catch, @stanhu: the comments from re2.h are particularly useful. What do you think the right behaviour should be re:

// If the regular expression being used might match
// an empty string, the loop body must check for this case and either
// advance the string or break out of the loop.

Should we advance the input ourselves or terminate the loop?

mudge commented 7 years ago

Looks like advancing the input ourselves using remove_prefix whenever we encounter an empty match or an empty capturing match will solve the infinite loop but it's a little finicky getting it to return nil when the end of the input is encountered.

stanhu commented 7 years ago

Yeah, I attempted the same with this, but it did not quite work:

--- a/ext/re2/re2.cc
+++ b/ext/re2/re2.cc
@@ -204,12 +204,15 @@ static VALUE re2_scanner_rewind(VALUE self) {
  */
 static VALUE re2_scanner_scan(VALUE self) {
   int i;
+  int found_match = 0;
+  size_t original_size;
   re2_pattern *p;
   re2_scanner *c;
   VALUE result;

   Data_Get_Struct(self, re2_scanner, c);
   Data_Get_Struct(c->regexp, re2_pattern, p);
+  original_size = c->input->size();

   vector<RE2::Arg> argv(c->number_of_capturing_groups);
   vector<RE2::Arg*> args(c->number_of_capturing_groups);
@@ -228,6 +231,7 @@ static VALUE re2_scanner_scan(VALUE self) {
       if (matches[i].empty()) {
         rb_ary_push(result, Qnil);
       } else {
+        found_match = 1;
         rb_ary_push(result, ENCODED_STR_NEW(matches[i].data(),
               matches[i].size(),
               p->pattern->options().utf8() ? "UTF-8" : "ISO-8859-1"));
@@ -237,6 +241,17 @@ static VALUE re2_scanner_scan(VALUE self) {
     result = Qnil;
   }

+  /* In case we matched a null string, advance the pointer to avoid an infinite loop */
+  if (!found_match &&
+      c->number_of_capturing_groups &&
+      c->input->size() &&
+      (original_size == c->input->size())) {
+      c->input->remove_prefix(1);
+  }
+
   return result;
 }

At GitLab, we're working around it by changing the Ruby code by iterating over match. I'll send the patch later.

mudge commented 7 years ago

I had something working by removing a 1 byte prefix from input (if the input size is not zero) in two cases:

When the match returned in a capturing group is empty;
When there is a match (viz. FindAndConsumeN returns true) but the number of capturing groups is zero.

I'm out for the evening but will take a look at this as soon as possible as the main challenge is keeping the logic clean.

The other thing this makes me wonder is whether we should distinguish between an empty capturing group (e.g. by returning an empty string) and a match without a capturing group at all. That would likely be backward incompatible though so we can defer that for now.

stanhu commented 7 years ago

FYI, this is what we did in GitLab: https://gitlab.com/nick.thomas/gitlab-ce/commit/dabc1fa388143808bab792448504dac4bae8992b

lupine commented 7 years ago

Ruby's Regexp distinguishes like so:

"foo".scan(//) # ["", "", "", ""]
"foo".scan(/()/) # [[""], [""], [""], [""]]
"foo".scan(/foo()/) # [[""]]
"foo".scan(/(foo())/) [["foo", ""]]

mudge commented 7 years ago

With the change in #35, these would be the corresponding results:

[9] pry(main)> RE2::Regexp.new('').scan("foo").to_a
=> [[], [], []]
[10] pry(main)> RE2::Regexp.new('()').scan("foo").to_a
=> [[nil], [nil], [nil]]
[11] pry(main)> RE2::Regexp.new('foo()').scan("foo").to_a
=> [[nil]]
[12] pry(main)> RE2::Regexp.new('(foo())').scan("foo").to_a
=> [["foo", nil]]

Aside from the empty string/nil difference, re2 results one less match with an empty pattern which makes intuitive sense to me (it matches each character and then stops) but perhaps I'm missing something?

mudge commented 7 years ago

Mmm, the base case in Ruby of an empty string and empty pattern is making me wonder if I should preserve this in re2 as well:

> ''.scan(//)
=> [""]

In order to do this, I might need to rework scanning to store if a scanner is exhausted: that is, where matching has advanced beyond the end of the input string, not just when the input size is 0.

mudge commented 7 years ago

Updated #35 to match Ruby's behaviour (while keeping compatibility re returning nil versus an empty string):

[1] pry(main)> RE2::Regexp.new('').scan('foo').to_a
=> [[], [], [], []]
[2] pry(main)> RE2::Regexp.new('()').scan('foo').to_a
=> [[nil], [nil], [nil], [nil]]
[3] pry(main)> RE2::Regexp.new('foo()').scan('foo').to_a
=> [[nil]]
[4] pry(main)> RE2::Regexp.new('(foo())').scan('foo').to_a
=> [["foo", nil]]

Does this now match your expectations?

mudge commented 7 years ago

I've released v1.1.0 which should hopefully fix this problem: please give it a whirl and let me know if it solves your issue.

lupine commented 7 years ago

Thanks for the quick turnaround on this @mudge ! https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/13036 updates gitlab to use re2 v1.1.0 and allows me to delete my homegrown scan :)

I tested the nil and empty match cases on a 1MiB build trace and they were both very fast. Great work!

mudge commented 7 years ago

You're very welcome! (Deleting code makes me very happy indeed.)

stanhu commented 7 years ago

@mudge Thank you!