llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.81k stars 11.91k forks source link

Changing place of "goto" makes code slower #43725

Open safinaskar opened 4 years ago

safinaskar commented 4 years ago
Bugzilla Link 44380
Version trunk
OS Linux
CC @DougGregor,@fhahn,@LebedevRI,@zygoloid,@rotateright

Extended Description

Consider this two sources: https://godbolt.org/z/j6uSL6 (STILLFAST) and https://godbolt.org/z/gcvDJV (BECAMESLOW) (all tests performed on clang 10, even if godbolt pages say something else). The only difference is placement of goto statement. This should not change meaning of the code. But one code gives vectorized code, and other is not. Change in speed is dramatic in my tests. So, changing placement of goto prevents optimizations. This is a bug, and so I reported it.

Now let me say what that code means.

Consider this source (I will codename it as STDGETC)

int c;
int result = 0;
while ((c = getc_unlocked(stdin)) != EOF)
  if (c == '\n')
    ++result;

This is, of course, simple implementation of wc -l. And it is slow in my tests.

Let's rewrite it so: https://godbolt.org/z/HtY3Ym (I name this RAWREAD)

Now the code is very fast, but it is too big.

So I've got an idea: to write fast stdio implementation. Such implementation should combine beautiful getc API with speed of raw implementation.

Here is first attempt to write such implementation: https://godbolt.org/z/BJJPYv (I name this MYIO). Of course, it is not full implementation yet. And it performs slower than RAWREAD.

So, I started to investigate why it performs slower. So I started to make incremental changes in RAWREAD until I get to MYIO. To catch that moment when code starts to be slow.

So here is chain of my changes to RAWREAD:

https://godbolt.org/z/QVC7Cz https://godbolt.org/z/HSP3hV https://godbolt.org/z/3BzQTi https://godbolt.org/z/s8HRr_ https://godbolt.org/z/yUptWS https://godbolt.org/z/P35x3N https://godbolt.org/z/g5JD-D https://godbolt.org/z/YPVh3j https://godbolt.org/z/j6uSL6 (at this point assembly changes, but code is still vectorized and fast, I will call this code STILLFAST) https://godbolt.org/z/gcvDJV (I will call this BECAMESLOW)

So, the last two sources are essentially same, but the second is slower. And this is a bug.

Debian stretch (with some packages from Debian buster), x86_64, Linux 4.19

clang-10 -v output:

clang version 10.0.0-+20191211115110+02168549172-1~exp1~20191211105657.1646 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/i686-linux-gnu/6
Found candidate GCC installation: /usr/bin/../lib/gcc/i686-linux-gnu/6.3.0
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/6
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/6.3.0
Found candidate GCC installation: /usr/lib/gcc/i686-linux-gnu/6
Found candidate GCC installation: /usr/lib/gcc/i686-linux-gnu/6.3.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/6
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/6.3.0
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/6.3.0
Candidate multilib: .;@m64
Selected multilib: .;@m64

This is clang installed from https://apt.llvm.org

Intel Core i7, hyper-threading is disabled at BIOS level, /proc/cpuinfo reports 4 cores

safinaskar commented 4 years ago

Let me state again my ultimate goal: I want to write fast stdio. I want MYIO (i. e. https://godbolt.org/z/BJJPYv ) to be as fast as RAWREAD ( https://godbolt.org/z/HtY3Ym ).

Also, here is another chain of incremental changes. This time without gotos.

https://godbolt.org/z/HtY3Ym (RAWREAD) https://godbolt.org/z/WoErQL https://godbolt.org/z/qA7NzQ https://godbolt.org/z/VRC58B https://godbolt.org/z/Lrcbc4

Last step performed using this transformation: from this:

for(;;){
  for(;;){
    A;
    if(COND)break;
    B;
  }
  C;
}

to this:

for(;;){
  for(;;){
    A;
    if(!COND)break;
    C;
  }
  B;
}

As you can see, this transformation changed vectorized code to non-vectorized, and this is a bug.

Also, this bug is possibly related: llvm/llvm-project#43705 (I found it when implementing wc -l, again)

rotateright commented 4 years ago

To understand why something did not vectorize, we could use "-Rpass-analysis=loop-vectorize" or similar to get some diagnostics.

Another way to look at the problem is to examine the differences in IR before we get to vectorization and see if there's a failed canonicalization: https://godbolt.org/z/qCHWSi

Maybe we could do something different/better in SCEV to get the 'sext' out of the loop and make the 2 cases identical?