Optimize case-insensitive matching for all-alphabetical cases

julian-klode commented 8 years ago

Currently re2c generates switches like

switch(c) {
case 'T':
case 't':
    ...
}

I found over the weekend - while writing https://github.com/julian-klode/triehash - that this can be substantially optimised: If all cases are alphabetical characters, we can transform it into:

switch(c | 0x20) { case 't':   ... }

(which one is chosen does not matter). In fact, that's not only true for alphabetical characters, but for 128 characters in a 256 ANSI like character set. The check I use for that in Perl is:

     # Setting the lowercase flag in the character produces a different
     # character, the character would thus not be matched.
     return 1 if ((ord($char) | 0x20) != ord(lc($char)));

     # A word is also ambiguous if any character in lowercase can be reached
     # by ORing 0x20 from another character in the charset that is not a
     # lowercase character of the current character.
     # Assume that we have UTF-8 and the most significant bit can be set
     for (my $i = 0; $i < 256; $i++) {
         return 1 if (($i | 0x20) == ord(lc($char)) && lc(chr($i)) ne lc($char));
     }
     return 0;

This can provide a good performance boost, as it now translates to:

load, or, cmp, cond-jmp

rather than

load, cmp, cond-jmp, cmp, cond-jmp

In my case, with triehash, this resulted in my benchmark performance improving from 714 ns to 495 ns, compared to 475 ns for the case-sensitive operation (re2c benchmarks similarly for the unoptimised case, so I think the result will be similar).

I'm not sure how feasible this is to implement for re2c, but it would be nice to have.

skvadrik commented 8 years ago

Hi! I think, this is the kind of optimisation re2c should definitely have (probably in a more general form).

In fact, re2c already has a similar optimisation (enabled with -b option) that tries to reason about bit patterns of the matched characters. It is not powerful enough to handle this case, though.

I experimented a little bit with your example (I expected gcc -O2 to optimize both cases equally well):

int f(char c)
{
    switch (c) {
        case 'A'...'Z':
        case 'a'...'z':
            return 1;
        default:
            return 0;
    }
}

versus

int f(char c)
{
    switch (c | 0x20) {
        case 'a'...'z':
            return 1;
        default:
            return 0;
    }
}

However, in the first case gcc -O2 -c 1.c -o 1.o && objdump -d 1.o:

0000000000000000 <f>:
   0:   8d 4f bf                lea    -0x41(%rdi),%ecx
   3:   31 c0                   xor    %eax,%eax
   5:   80 f9 39                cmp    $0x39,%cl
   8:   77 1b                   ja     25 <f+0x25>
   a:   b8 01 00 00 00          mov    $0x1,%eax
   f:   48 ba ff ff ff 03 ff    movabs $0x3ffffff03ffffff,%rdx
  16:   ff ff 03 
  19:   48 d3 e0                shl    %cl,%rax
  1c:   48 85 d0                test   %rdx,%rax
  1f:   0f 95 c0                setne  %al
  22:   0f b6 c0                movzbl %al,%eax
  25:   f3 c3                   repz retq

while in the second case gcc -O2 -c 2.c -o 2.o && objdump -d 2.o:

0000000000000000 <f>:
   0:   83 cf 20                or     $0x20,%edi
   3:   31 c0                   xor    %eax,%eax
   5:   83 ef 61                sub    $0x61,%edi
   8:   40 80 ff 19             cmp    $0x19,%dil
   c:   0f 96 c0                setbe  %al
   f:   c3                      retq

Clang does somewhat better in the first case, but still not as good as in the second case.

So it's up to re2c after all.

Technically this optimization should be not hard to implement. It is low-level and doesn't interfere with complex algorithms involved in DFA construction, etc. It can be implemented in a separate pass during code generation (re2c already has the necessary infrastructure).

What we need is a good heuristic. One should probably start simple: handle one special case of case-insensitive all-alphabetical characters. But I think should be a lot research on the topic.

The results can be verified using --skeleton.

skvadrik commented 8 years ago

Clang does somewhat better in the first case

After a closer look this is not true: https://llvm.org/bugs/show_bug.cgi?id=30531

skvadrik / re2c

Optimize case-insensitive matching for all-alphabetical cases #155