Next gen non-allocating constexpr-folding future-promise does not optimise well on clang

1feda113-18e2-42e3-8624-9c17d4d32ec3 commented 9 years ago


Bugzilla Link	23652
Version	trunk
OS	Linux
CC	@asl,@chandlerc

Extended Description

As part of working on next generation non-allocating constexpr-folding future-promises for the Boost.Thread rewrite (and with the hope these become the next STL future-promises), clang currently does not perform ideally as compared to GCC.

I have spoken with Chandler Carruth about these at C++ Now, and he may chime in here about the importance of clang matching GCC in performance with these. I am also raising these with colleagues on the MSVC team, as poor old VS2015 generates about 3000 opcodes for the last example :(.

Anyway as a quick summary, under these next-gen future-promises this sequence:

extern BOOST_SPINLOCK_NOINLINE int test1() { using namespace boost::spinlock::lightweight_futures; monad<int, true> m(5); return m.get(); }

... should turn into:

0000000000000000 <_Z5test1v>: 0: b8 05 00 00 00 mov $0x5,%eax 5: c3 retq

... and indeed does under GCC, but under clang 3.6 and 3.7 turns into:

0000000000000000 <_Z5test1v>: 0: 53 push %rbx 1: 48 83 ec 20 sub $0x20,%rsp 5: c7 44 24 08 05 00 00 movl $0x5,0x8(%rsp) c: 00 d: c7 44 24 18 01 00 00 movl $0x1,0x18(%rsp) 14: 00 15: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi 1a: e8 00 00 00 00 callq 1f <_Z5test1v+0x1f> 1f: 89 c3 mov %eax,%ebx 21: 8b 44 24 18 mov 0x18(%rsp),%eax 25: ff c8 dec %eax 27: 83 f8 03 cmp $0x3,%eax 2a: 77 24 ja 50 <_Z5test1v+0x50> 2c: ff 24 c5 00 00 00 00 jmpq 0x0(,%rax,8) 33: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi 38: e8 00 00 00 00 callq 3d <_Z5test1v+0x3d> 3d: eb 09 jmp 48 <_Z5test1v+0x48> 3f: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp) 46: 00 00 48: c7 44 24 18 00 00 00 movl $0x0,0x18(%rsp) 4f: 00 50: 89 d8 mov %ebx,%eax 52: 48 83 c4 20 add $0x20,%rsp 56: 5b pop %rbx 57: c3 retq
58: 48 89 c3 mov %rax,%rbx 5b: 8b 44 24 18 mov 0x18(%rsp),%eax 5f: ff c8 dec %eax 61: 83 f8 03 cmp $0x3,%eax 64: 77 24 ja 8a <_Z5test1v+0x8a> 66: ff 24 c5 00 00 00 00 jmpq 0x0(,%rax,8) 6d: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi 72: e8 00 00 00 00 callq 77 <_Z5test1v+0x77> 77: eb 09 jmp 82 <_Z5test1v+0x82> 79: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp) 80: 00 00 82: c7 44 24 18 00 00 00 movl $0x0,0x18(%rsp) 89: 00 8a: 48 89 df mov %rbx,%rdi 8d: e8 00 00 00 00 callq 92 <_Z5test1v+0x92> 92: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) 99: 1f 84 00 00 00 00 00

This is highly unfortunate, because monad is the base class of future, and therefore this promise-future sequence:

extern BOOST_SPINLOCK_NOINLINE int test1() { using namespace boost::spinlock::lightweight_futures; promise p; p.set_value(5); future f(p.get_future()); return f.get(); }

... which under GCC correctly turns into:

0000000000000010 <_Z5test1v>: 10: b8 05 00 00 00 mov $0x5,%eax 15: c3 retq

... under clang 3.6 and 3.7 most unfortunately turns into:

0000000000000000 <_Z5test1v>: 0: 53 push %rbx 1: 48 83 ec 50 sub $0x50,%rsp 5: c7 44 24 40 00 00 00 movl $0x0,0x40(%rsp) c: 00 d: c6 44 24 48 00 movb $0x0,0x48(%rsp) 12: c7 44 24 2c 05 00 00 movl $0x5,0x2c(%rsp) 19: 00 1a: 48 8d 5c 24 30 lea 0x30(%rsp),%rbx 1f: 48 8d 74 24 2c lea 0x2c(%rsp),%rsi 24: 48 89 df mov %rbx,%rdi 27: e8 00 00 00 00 callq 2c <_Z5test1v+0x2c> 2c: 48 8d 3c 24 lea (%rsp),%rdi 30: 48 89 de mov %rbx,%rsi 33: e8 00 00 00 00 callq 38 <_Z5test1v+0x38> 38: 48 8d 3c 24 lea (%rsp),%rdi 3c: e8 00 00 00 00 callq 41 <_Z5test1v+0x41> 41: 89 c3 mov %eax,%ebx 43: 48 8d 3c 24 lea (%rsp),%rdi 47: e8 00 00 00 00 callq 4c <_Z5test1v+0x4c> 4c: 48 8d 7c 24 30 lea 0x30(%rsp),%rdi 51: e8 00 00 00 00 callq 56 <_Z5test1v+0x56> 56: 89 d8 mov %ebx,%eax 58: 48 83 c4 50 add $0x50,%rsp 5c: 5b pop %rbx 5d: c3 retq
5e: 48 89 c3 mov %rax,%rbx 61: eb 0c jmp 6f <_Z5test1v+0x6f> 63: 48 89 c3 mov %rax,%rbx 66: 48 8d 3c 24 lea (%rsp),%rdi 6a: e8 00 00 00 00 callq 6f <_Z5test1v+0x6f> 6f: 48 8d 7c 24 30 lea 0x30(%rsp),%rdi 74: e8 00 00 00 00 callq 79 <_Z5test1v+0x79> 79: 48 89 df mov %rbx,%rdi 7c: e8 00 00 00 00 callq 81 <_Z5test1v+0x81> 81: 66 66 66 66 66 66 2e data32 data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) 88: 0f 1f 84 00 00 00 00 8f: 00

... which I should imagine would be quite a performance penalty.

I asked clang to -save-temps, and the dump for all the unit tests along with the command options used can be found at:

clang 3.5:

https://drive.google.com/file/d/0B5QDPUNHLpKMcTJXd2lqZ1lKNTA/view?usp=sharing

clang 3.6:

https://drive.google.com/file/d/0B5QDPUNHLpKMQ1g1SU9WbUJiWWc/view?usp=sharing

clang 3.7:

https://drive.google.com/file/d/0B5QDPUNHLpKMaUNNWXhqSi1oM3c/view?usp=sharing

Niall

1feda113-18e2-42e3-8624-9c17d4d32ec3 commented 9 years ago

GCC has perfect results. The terrible result for promise_future_reduce on VS2015 is hoped to be fixed by Microsoft shortly after RTM.

This now makes clang 3.7 look particularly poor relative to GCC and MSVC.

More useful information for this bug report. For just the monadic transport only, here are the best case and worst case opcode generation results:

clang 3.7 59 opcodes <= Value transport <= 37 opcodes 7 opcodes <= Error transport <= 52 opcodes 38 opcodes <= Exception transport <= 39 opcodes

GCC 5.1 1 opcodes <= Value transport <= 113 opcodes 8 opcodes <= Error transport <= 119 opcodes 22 opcodes <= Exception transport <= 214 opcodes

VS2015 4 opcodes <= Value transport <= 1881 opcodes 6 opcodes <= Error transport <= 164 opcodes 1946 opcodes <= Exception transport <= 1936 opcodes

The maximum is calculated by taking a monad in from a non-visible source where the compiler has to generate all code paths to handle an unknown (variant) input state, whereas the minimum is calculated by setting a monad's state in view of the compiler's optimiser such that it can usually completely elide opcodes generated (though note that varies enormously by compiler to the extent that the known code generates more opcodes than the unknown code). From an optimiser's perspective, in the minimum case it can elide all code paths forwards which could never be executed anyway, whilst for the maximum case it can elide all code paths backwards on the basis that the only two outcomes are fetching the value/error/exception or throwing an exception.

In terms of maximum opcodes generated, clang 3.7 does very well indeed, and the best of any of the compilers showing it optimises backwards very well. GCC 5.1 does the best in minimum opcodes generated showing it optimises forwards very well. VS2015 is somewhere between great and terrible, though it has no choice if an exception_ptr is ever touched as Microsoft require some unfortunately very hard memory effects due to backward ABI compatibility.

If you could get clang's minimum case (forward optimisation) as good as GCC, that would be enormous.

Niall

1feda113-18e2-42e3-8624-9c17d4d32ec3 commented 9 years ago

An update to this bug: the Microsoft Visual Studio team have very kindly helped me get this code to persuade MSVC to collapse these lightweight future promises into no opcodes output, and the current stats are as follows for x64 opcodes generated for each of these tests:

                                   GCC 5.1   VS2015   clang 3.7

future_construct_destruct.cpp 0 3 7 future_construct_move_destruct.cpp 0 3 18 monad_construct_destruct.cpp 0 3 0 monad_construct_value_destruct.cpp 1 4 33 promise_construct_destruct.cpp 0 3 6 promise_construct_move_destruct.cpp 0 3 18 promise_future_reduce.cpp 1 3111 32

GCC has perfect results. The terrible result for promise_future_reduce on VS2015 is hoped to be fixed by Microsoft shortly after RTM.

This now makes clang 3.7 look particularly poor relative to GCC and MSVC.

I'll give you fair warning now: I expect to present on these at C++ conferences in Bristol and Seattle from Spring 2016 onwards. If clang is still performing poorly relative to the others, I'll say so publicly, and with benchmarks.

Niall

llvm / llvm-project

Next gen non-allocating constexpr-folding future-promise does not optimise well on clang #24026

Extended Description