Open safinaskar opened 4 years ago
Let me state again my ultimate goal: I want to write fast stdio. I want MYIO (i. e. https://godbolt.org/z/BJJPYv ) to be as fast as RAWREAD ( https://godbolt.org/z/HtY3Ym ).
Also, here is another chain of incremental changes. This time without gotos.
https://godbolt.org/z/HtY3Ym (RAWREAD) https://godbolt.org/z/WoErQL https://godbolt.org/z/qA7NzQ https://godbolt.org/z/VRC58B https://godbolt.org/z/Lrcbc4
Last step performed using this transformation: from this:
for(;;){
for(;;){
A;
if(COND)break;
B;
}
C;
}
to this:
for(;;){
for(;;){
A;
if(!COND)break;
C;
}
B;
}
As you can see, this transformation changed vectorized code to non-vectorized, and this is a bug.
Also, this bug is possibly related: llvm/llvm-project#43705 (I found it when implementing wc -l
, again)
To understand why something did not vectorize, we could use "-Rpass-analysis=loop-vectorize" or similar to get some diagnostics.
Another way to look at the problem is to examine the differences in IR before we get to vectorization and see if there's a failed canonicalization: https://godbolt.org/z/qCHWSi
Maybe we could do something different/better in SCEV to get the 'sext' out of the loop and make the 2 cases identical?
Extended Description
Consider this two sources: https://godbolt.org/z/j6uSL6 (STILLFAST) and https://godbolt.org/z/gcvDJV (BECAMESLOW) (all tests performed on clang 10, even if godbolt pages say something else). The only difference is placement of
goto
statement. This should not change meaning of the code. But one code gives vectorized code, and other is not. Change in speed is dramatic in my tests. So, changing placement ofgoto
prevents optimizations. This is a bug, and so I reported it.Now let me say what that code means.
Consider this source (I will codename it as STDGETC)
This is, of course, simple implementation of
wc -l
. And it is slow in my tests.Let's rewrite it so: https://godbolt.org/z/HtY3Ym (I name this RAWREAD)
Now the code is very fast, but it is too big.
So I've got an idea: to write fast stdio implementation. Such implementation should combine beautiful
getc
API with speed of raw implementation.Here is first attempt to write such implementation: https://godbolt.org/z/BJJPYv (I name this MYIO). Of course, it is not full implementation yet. And it performs slower than RAWREAD.
So, I started to investigate why it performs slower. So I started to make incremental changes in RAWREAD until I get to MYIO. To catch that moment when code starts to be slow.
So here is chain of my changes to RAWREAD:
https://godbolt.org/z/QVC7Cz https://godbolt.org/z/HSP3hV https://godbolt.org/z/3BzQTi https://godbolt.org/z/s8HRr_ https://godbolt.org/z/yUptWS https://godbolt.org/z/P35x3N https://godbolt.org/z/g5JD-D https://godbolt.org/z/YPVh3j https://godbolt.org/z/j6uSL6 (at this point assembly changes, but code is still vectorized and fast, I will call this code STILLFAST) https://godbolt.org/z/gcvDJV (I will call this BECAMESLOW)
So, the last two sources are essentially same, but the second is slower. And this is a bug.
Debian stretch (with some packages from Debian buster), x86_64, Linux 4.19
clang-10 -v
output:This is clang installed from https://apt.llvm.org
Intel Core i7, hyper-threading is disabled at BIOS level, /proc/cpuinfo reports 4 cores