ruby / strscan

Provides lexical scanning operations on a String.
BSD 2-Clause "Simplified" License
81 stars 32 forks source link

Optimize scan method #107

Closed naitoh closed 2 weeks ago

naitoh commented 2 weeks ago

CRuby

Why?

1. Remove duplicate if (S_RESTLEN(p) < RSTRING_LEN(pattern)) return Qnil; checks in !headonly.

A similar check is made within rb_memsearch() within !headonly.

https://github.com/ruby/ruby/blob/cf8388f76c4c2ff2f46d0d2aa2cf5186e05ff606/re.c#L251-L256

long
rb_memsearch(const void *x0, long m, const void *y0, long n, rb_encoding *enc)
{
    const unsigned char *x = x0, *y = y0;

    if (m > n) return -1;

This means the following : if (RSTRING_LEN(pattern) > S_RESTLEN(p)) return -1;

Both checks are the same.

2. Removed unnecessary use of rb_enc_get()

In rb_strseq_index(), the result of rb_enc_check() is used.

return strseq_core(str_ptr, str_ptr_end, str_len, sub_ptr, sub_len, offset, enc);

Benchmark

It shows String as a pattern is 1.23x faster than Regexp as a pattern.

$ benchmark-driver benchmark/check_until.yaml
Warming up --------------------------------------
              regexp     9.300M i/s -      9.509M times in 1.022507s (107.53ns/i)
          regexp_var     9.110M i/s -      9.262M times in 1.016682s (109.76ns/i)
              string     9.051M i/s -      9.304M times in 1.028047s (110.49ns/i)
          string_var    11.187M i/s -     11.722M times in 1.047826s (89.39ns/i)
Calculating -------------------------------------
              regexp    10.197M i/s -     27.899M times in 2.735904s (98.06ns/i)
          regexp_var    10.198M i/s -     27.331M times in 2.680120s (98.06ns/i)
              string    10.089M i/s -     27.152M times in 2.691312s (99.12ns/i)
          string_var    12.530M i/s -     33.562M times in 2.678533s (79.81ns/i)

Comparison:
          string_var:  12529824.3 i/s
          regexp_var:  10197773.2 i/s - 1.23x  slower
              regexp:  10197371.0 i/s - 1.23x  slower
              string:  10088701.3 i/s - 1.24x  slower

JRuby

Why?

1. Remove duplicate if (restLen() < pattern.size()) return context.nil; checks in !headonly.

https://github.com/ruby/strscan/blob/d31274f41b7c1e28f23d58cf7bfea03baa818cb7/ext/jruby/org/jruby/ext/strscan/RubyStringScanner.java#L371-L373

This means the following :

if (str.size() - curr < pattern.size()) return context.nil;

A similar check is made within StringSupport#index() within !headonly.

https://github.com/jruby/jruby/blob/be7815ec02356a58891c8727bb448f0c6a826d96/core/src/main/java/org/jruby/util/StringSupport.java#L1706-L1720

    public static int index(ByteList source, ByteList other, int offset, Encoding enc) {
        int sourceLen = source.realSize();
        int sourceBegin = source.begin();
        int otherLen = other.realSize();

        if (otherLen == 0) return offset;
        if (sourceLen - offset < otherLen) return -1;

This means the following : if (strBL.realSize() - (strBeg + curr) < patternBL.realSize()) return -1;

Both checks are the same.

2. Use currPtr() instead of strBeg + curr. Because they are identical.

https://github.com/ruby/strscan/blob/d31274f41b7c1e28f23d58cf7bfea03baa818cb7/ext/jruby/org/jruby/ext/strscan/RubyStringScanner.java#L267-L268

https://github.com/ruby/strscan/blob/d31274f41b7c1e28f23d58cf7bfea03baa818cb7/ext/jruby/org/jruby/ext/strscan/RubyStringScanner.java#L359-L361

Benchmark

It shows String as a pattern is 2.43x faster than Regexp as a pattern.

$ benchmark-driver benchmark/check_until.yaml
Warming up --------------------------------------
              regexp     7.371M i/s -      7.352M times in 0.997443s (135.67ns/i)
          regexp_var     7.303M i/s -      7.262M times in 0.994284s (136.92ns/i)
              string    13.596M i/s -     13.535M times in 0.995475s (73.55ns/i)
          string_var    15.032M i/s -     14.942M times in 0.994038s (66.53ns/i)
Calculating -------------------------------------
              regexp     9.120M i/s -     22.113M times in 2.424781s (109.65ns/i)
          regexp_var     8.914M i/s -     21.910M times in 2.458050s (112.19ns/i)
              string    22.174M i/s -     40.789M times in 1.839495s (45.10ns/i)
          string_var    19.994M i/s -     45.095M times in 2.255454s (50.02ns/i)

Comparison:
              string:  22174077.0 i/s
          string_var:  19993967.8 i/s - 1.11x  slower
              regexp:   9119635.2 i/s - 2.43x  slower
          regexp_var:   8913743.3 i/s - 2.49x  slower
naitoh commented 2 weeks ago

@kou

Could you add how to optimize this to the description?

Sorry. I added it to the description.

kou commented 2 weeks ago

Thanks. How about using 1 PR for 1 optimization?

It seems that this has 4 optimizations:

  1. CRuby: (S_RESTLEN(p) < RSTRING_LEN(pattern))
  2. CRuby: rb_enc_get()
  3. JRuby: (restLen() < pattern.size())
  4. JRuby: currPtr()
naitoh commented 2 weeks ago

How about using 1 PR for 1 optimization?

OK, I see.