llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.24k stars 11.66k forks source link

clang -O3 does not produce best blend code #18977

Open llvmbot opened 10 years ago

llvmbot commented 10 years ago
Bugzilla Link 18603
Version trunk
OS Linux
Reporter LLVM Bugzilla Contributor
CC @atrick

Extended Description

-O3 seems to assume an out of order architecture (GCC does not). A better assumption would be in order architecture; IMHO -O3 ought to produce the "best blend" which runs on all processors. If e.g. -march=atom is given the in order architecture is assumed. Example:

int test(int x) { return ((x>>2) & 15) ^ ((x>>3) & 31); } => clang -O3 movl %edi, %eax shrl $2, %eax andl $15, %eax // shrl $3, %edi // exchange andl $31, %edi xorl %eax, %edi movl %edi, %eax retq => clang -O3 -march=atom movl %edi, %eax shrl $3, %edi shrl $2, %eax andl $31, %edi andl $15, %eax xorl %eax, %edi movl %edi, %eax retq

atrick commented 10 years ago

This bug should be "clang does not have a blended machine model for x86_64."

clang at -O3 guesses the target from the host machine. If you don't want that, you need to use -march=XX.

So of course clang will not attempt to schedule the code for an in-order Atom cpu by default if you're not building on Atom. Cases certainly exist where you could improve the code for a particular micro-architecture without harming others, but we don't have the machine model or heuristics to accomplish that. gcc's scheduler may luckily(?) decide to reschedule instructions, not because it is aware that it will help Atom, but that it thinks it could improve expected target processor--just a guess.

If it is important to someone to build a hybrid machine model, then that could be done. e.g. -march=blended_intel New scheduling heuristics could even be plugged in as a separate scheduling strategy. However, we do not want to add complexity and overhead to the normal compilation path, which simply targets the expected machine.

Honestly, this seems like a very strange goal to me anyway. In-order Atom + Silvermont + SandyBridge/Haswell are completely different microarchitectures. I'm not sure how a blended model would be any different than "-march=atom", which has the most constraints.

Generally speaking:

If someone wants the best performance on hardware, they should use -march=mycpu.

If someone wants to build a binary that runs everywhere with mediocre performance, they should use -march=generic.

If a vendor had a family of microarchitectures that was closely related but highly sensitive to scheduling, it might make sense to have a blended model. This just isn't the current situation as I see it with Intel. Instead, as we move to new generations we end up enabling new processor features. So a blended model is really the same as targeting the oldest one.