clang-cl can not support function targets

carewolf commented 2 years ago

The immintrin.h headers has a bug where it does not include sub-arch intrinsics headers if _MSC_VER is defined unless the subarch target is current.

This is inconsistent with MSVC which always defines intrinsics regardless of active arch, and is also inconsistent with normal-clang (gcc-style) which defines them for use with subarch function targets. The result is there is no way of using intrinsics for subarch targets with clang-cl, since neither gcc nor msvc style works.

This has forced us to disable many optimizations in Qt with clang-cl, see https://bugreports.qt.io/browse/QTBUG-88081, https://bugreports.qt.io/browse/QTBUG-88434 and https://bugreports.qt.io/browse/QTBUG-98253

I suggest at least allowing gcc style, and the intrinsics working in target attributed functions, if you can't support the MSVC style.

zero9178 commented 2 years ago

This is indeed a nasty divergence from MSVC and one that IIRC is actually quite hard to fix as its also an issue in LLVM, not just clang. There is a workaround however as you can use the /clang: flag to pass GNU style command line options. That way you can eg. pass /clang:-mrdseed to enable that sub-arch feature.

llvmbot commented 2 years ago

@llvm/issue-subscribers-clang-codegen

carewolf commented 2 years ago

That won't help. The problem as I see it is that immintrin.h does not include non-current subarch targets when _MSC_VER is defined. I think if that was removed it still wouldn't work as MSVC does, but it would atleast work in functions with an appropiate target function attribute (clang/gcc style).

zero9178 commented 2 years ago

I am not quite sure as to how my above suggestion differs from clang/gcc style. Using the GCC style -m options will lead to the definition of the various feature test macros and therefore the inclusion of the sub arch headers despite _MSC_VER being defined. As an example from Qt source I wrote:

#include <immintrin.h>

int main()
{
        unsigned int value;
        _rdrand32_step(&value);
}

and then compiled it using clang-cl test.cpp /clang:-mrdrnd. As far as I could tell this should work with all the intrinsics.

But yes this won't match MSVC behaviour, which is clearly a bug. The above would simply serve as a workaround

carewolf commented 2 years ago

I think we are talking past each other. I am talking about using intrinsics without having similar commandline flags for runtime target detection

MSVC style:

#include <immintrin.h>
void foo_avx2(args) {
 _mm256_avx_command();
}

clang/gcc style:

#include <immintrin.h>
__attribute__(__target__("arch=haswell")) // or __attribute__(__target__("haswell")), can't remember which is gcc and which is clang
void foo_avx2(args) {
 _mm256_avx_command();
}

This is for runtime detection of CPU feature, not anything enabled at compile time.

In MSVC non-target intrinsics always works (somehow), in clang/gcc they work if appropriate function target has been declared where used. In clang-cl neither works because the compile-time flags are checked before defining the functions in the immintrin.h header.

wangwenx190 commented 2 years ago

Please fix this bug in clang-cl, this issue is really annoying.

ThiagoIze commented 2 years ago

I think this has already been explained by Allan, but just to make sure it's clear. This bug is preventing us from writing a single Windows binary that can be optimized for various architectures. For instance, we'd like the binary to work on older CPUs that don't have AVX and on newer machines with AVX we can run AVX optimized functions instead of the generic SSE2 functions. We might even have optimizations for AVX-512.

Fixing this can give applications a 4x speedup on Windows while still allowing the application to run on older CPUs. It's not everyday a compiler can do something that gives a 4x speedup, so I think this should be a high priority to fix.

wangwenx190 commented 2 years ago

@ThiagoIze is right, this is performance critical for many applications, please prioritize this issue.

wangwenx190 commented 2 years ago

Any progress on this?

wangwenx190 commented 2 years ago

🤔

thiagomacieira commented 1 year ago

Does anyone know what the limitation is? If I copy the definitions from avxintrin.h:

#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("avx"), __min_vector_width__(256)))
typedef int __v8si __attribute__ ((__vector_size__ (32)));
typedef long long __m256i __attribute__((__vector_size__(32), __aligned__(32)));
typedef long long __m256i_u __attribute__((__vector_size__(32), __aligned__(1)));

static __inline __m256i __DEFAULT_FN_ATTRS
_mm256_set_epi32(int __i0, int __i1, int __i2, int __i3,
                 int __i4, int __i5, int __i6, int __i7)
{
  return __extension__ (__m256i)(__v8si){ __i7, __i6, __i5, __i4, __i3, __i2, __i1, __i0 };
}
static __inline __m256i __DEFAULT_FN_ATTRS
_mm256_set1_epi32(int __i)
{
  return _mm256_set_epi32(__i, __i, __i, __i, __i, __i, __i, __i);
}
static __inline void __DEFAULT_FN_ATTRS
_mm256_storeu_si256(__m256i_u *__p, __m256i __a)
{
  struct __storeu_si256 {
    __m256i_u __v;
  } __attribute__((__packed__, __may_alias__));
  ((struct __storeu_si256*)__p)->__v = __a;
}

__attribute__((target("avx2"))) void fill(void *ptr, int n)
{
    __m256i v = _mm256_set1_epi32(n);
    _mm256_storeu_si256((__m256i *)ptr, v);
}

it compiles and works just fine (LLVM 15.0.6):

$ clang-cl -c -O2 test.cpp
$ objdump -dr test.obj    

test.obj:     file format pe-x86-64

Disassembly of section .text:

0000000000000000 <?fill@@YAXPEAXH@Z>:
   0:   c5 f9 6e c2             vmovd  %edx,%xmm0
   4:   c4 e2 7d 58 c0          vpbroadcastd %xmm0,%ymm0
   9:   c5 fe 7f 01             vmovdqu %ymm0,(%rcx)
   d:   c5 f8 77                vzeroupper
  10:   c3                      ret

wangwenx190 commented 1 year ago

I've encountered with another similar issue, which prevents me from building Qt using clang-cl: https://bugreports.qt.io/browse/QTBUG-113231. However, I've fixed it and the workaround is really simple. For some unknown reason, clang-cl can't find some intrin functions, so I modified immintrin.h to let the corresponding headers be included unconditionally. And then the compilation error is solved. I'm not sure what I'm doing is correct or not, but at least the compilation now goes smoothly without any errors and the generated binary file also seem to work fine.

carewolf commented 1 year ago

I've encountered with another similar issue, which prevents me from building Qt using clang-cl: https://bugreports.qt.io/browse/QTBUG-113231. However, I've fixed it and the workaround is really simple. For some unknown reason, clang-cl can't find some intrin functions, so I modified immintrin.h to let the corresponding headers be included unconditionally. And then the compilation error is solved. I'm not sure what I'm doing is correct or not, but at least the compilation now goes smoothly without any errors and the generated binary file also seem to work fine.

No I suspect that is the correct solution. The stupid include breaks are the issue, and it makes no sense they are there. Just nobody in the project have tried doing it. So it probably takes somebody outside of LLVM to fix this bug.

thiagomacieira commented 1 year ago

Those were added for a reason. The question is whether that reason is still valid. I suspect it isn't: the reason must have been that the __attribute__((target(xxxx))) didn't work in previous versions and that has since been corrected.

wangwenx190 commented 1 year ago

These header guards were added in 379a195, which is included since llvm 3.9.0 (2016)

AaronBallman commented 1 year ago

Those were added for a reason. The question is whether that reason is still valid. I suspect it isn't: the reason must have been that the __attribute__((target(xxxx))) didn't work in previous versions and that has since been corrected.

They were added because including this header without them induces ~10-30% compile time overhead, which you often have no say in because it's included by system headers.

CC @nico for awareness

thiagomacieira commented 1 year ago

I prefer to pay the penalty of 10 to 30% slowness compared to not being able to compile code that Clang-non-CL and MSVC compile.

AaronBallman commented 1 year ago

It's something we need to solve, but it's not acceptable to introduce that amount of compile time regression when solving it.

thiagomacieira commented 1 year ago

I agree it's something to solve, but disagree that the cost is unacceptable. As I said, this is the difference between "good compiler generates really good code" (hopefully) and "broken compiler, don't even report bugs to us". If there is a good chance that the compilation time slowness will get sufficiently solved in the short term, then the delay is acceptable. Conversely, if there's no chance of that happening soon (too difficult, no one working on it, etc.), then an indefinite delay is not acceptable.

I also don't know much code that includes immintrin.h and family in public headers. They're usually kept in private headers used exclusively for implementations, so the cost in compilation time is reduced. More importantly, they're also the ones that want to use the header and right now can't.

ADKaster commented 1 year ago

I also don't know much code that includes immintrin.h and family in public headers.

According to this comment by @StephanTLavavej Microsoft/STL, immintrin.h is included by intrin.h, which seems to be included in a core STL header.

https://github.com/microsoft/STL/pull/3285#discussion_r1049114792

If I'm reading that right, every TU that uses the C++ standard library on Windows includes this header? That seems to suggest that a performance impact as high as 10-30% would be quite unacceptable...

Unless I'm misreading the STL code, of course. 😅

thiagomacieira commented 1 year ago

Ah, I see. They're probably using intrinsics for and similar functionality. I think I've seen other uses too in their headers, like in https://github.com/microsoft/STL/blob/091cad2eaaa5bc25873eb7261cae57ab123592f3/stl/inc/bit#L144-L145.

libstdc++ and libc++ usually use the __builtin_ type intrinsics which are always pre-defined and forego including the intrinsic headers. That same STL header has such a case: https://github.com/microsoft/STL/blob/091cad2eaaa5bc25873eb7261cae57ab123592f3/stl/inc/bit#L35-L37

Anyway, this does mean the impact of changing immintrin.h is much higher than I'd thought.

AaronBallman commented 1 year ago

Anyway, this does mean the impact of changing immintrin.h is much higher than I'd thought.

Yup, that's why I was saying the slowdown was not acceptable -- it impacts roughly everything compiled on Windows, which makes this tricky to resolve. That said, I think we need a solution of some kind.

thiagomacieira commented 1 year ago

BTW, do you know if the slowdown is caused by the presence of the __attribute__ or if it is the number of functions defined in that header in the first place? In other words, does one suffer from this slow-down when using /arch:AVX512?

AaronBallman commented 1 year ago

BTW, do you know if the slowdown is caused by the presence of the __attribute__ or if it is the number of functions defined in that header in the first place? In other words, does one suffer from this slow-down when using /arch:AVX512?

It's the size of the header file, I believe.

PS F:\source\llvm-project> Measure-Command { .\llvm\out\build\x64-Debug\bin\clang-cl.exe /c "C:\Users\aballman\OneDrive - Intel Corporation\Desktop\test.cpp" }
C:\Users\aballman\OneDrive - Intel Corporation\Desktop\test.cpp(27,5): error: unknown type name '__m256i'
    __m256i v = _mm256_set1_epi32(n);
    ^
C:\Users\aballman\OneDrive - Intel Corporation\Desktop\test.cpp(27,17): error: use of undeclared identifier
      '_mm256_set1_epi32'; did you mean '_mm_set1_epi32'?
    __m256i v = _mm256_set1_epi32(n);
                ^~~~~~~~~~~~~~~~~
                _mm_set1_epi32
F:\source\llvm-project\llvm\out\build\x64-Debug\lib\clang\17\include\emmintrin.h(3618,46): note: '_mm_set1_epi32'
      declared here
static __inline__ __m128i __DEFAULT_FN_ATTRS _mm_set1_epi32(int __i) {
                                             ^
C:\Users\aballman\OneDrive - Intel Corporation\Desktop\test.cpp(28,28): error: unknown type name '__m256i'
    auto dst = static_cast<__m256i *>(ptr);
                           ^
C:\Users\aballman\OneDrive - Intel Corporation\Desktop\test.cpp(30,26): error: arithmetic on a pointer to void
        _mm256_storeu_si256(dst + i / sizeof(v), v);
                            ~~~ ^
C:\Users\aballman\OneDrive - Intel Corporation\Desktop\test.cpp(31,19): error: arithmetic on a pointer to void
    fill_tail(dst + i / sizeof(v), len - i, n);
              ~~~ ^
5 errors generated.

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 4
Milliseconds      : 480
Ticks             : 44800486
TotalDays         : 5.18524143518519E-05
TotalHours        : 0.00124445794444444
TotalMinutes      : 0.0746674766666667
TotalSeconds      : 4.4800486
TotalMilliseconds : 4480.0486

PS F:\source\llvm-project> Measure-Command { .\llvm\out\build\x64-Debug\bin\clang-cl.exe /c /arch:AVX512 "C:\Users\aballman\OneDrive - Intel Corporation\Desktop\test.cpp" }

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 16
Milliseconds      : 690
Ticks             : 166907541
TotalDays         : 0.000193180024305556
TotalHours        : 0.00463632058333333
TotalMinutes      : 0.278179235
TotalSeconds      : 16.6907541
TotalMilliseconds : 16690.7541

thiagomacieira commented 1 year ago

Well, one of them had code generation and the other one didn't, so not exactly a fair comparison. Testing just the parsing in an empty file:

$ cat test.cpp
#include <immintrin.h>
$ clang-cl --version
clang version 15.0.1
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\Llvm\x64\bin
$ time sh -c 'for ((i=0;i<10;++i)); do clang-cl -c test.cpp; done'
sh -c 'for ((i=0;i<10;++i)); do clang-cl -c test.cpp; done'  0.04s user 0.29s system 38% cpu 0.860 total
$ time sh -c 'for ((i=0;i<10;++i)); do clang-cl -c -march=sapphirerapids test.cpp; done'
sh -c   0.00s user 0.29s system 13% cpu 2.137 total

Anecdotal evidence is a 2.5x.

thiagomacieira commented 1 year ago

And testing as incremental cost:

$ cat test.cpp
#include <immintrin.h>
#include <algorithm>
#include <windows.h>
$ time sh -c 'for ((i=0;i<10;++i)); do clang-cl -c test.cpp; done'
sh -c 'for ((i=0;i<10;++i)); do clang-cl -c test.cpp; done'  0.03s user 0.21s system 5% cpu 4.165 total
$ time sh -c 'for ((i=0;i<10;++i)); do clang-cl -c -march=sapphirerapids test.cpp; done'
sh -c   0.01s user 0.25s system 4% cpu 5.580 total

On a proportional basis, this dropped to 1.34x. On absolute terms, it went from 1277 ms to 1415 ms. I can reliably get the same numbers, otherwise I'd call this simply noise.

AaronBallman commented 1 year ago

Well, one of them had code generation and the other one didn't, so not exactly a fair comparison. Testing just the parsing in an empty file:

Oh derp, I was thinking /c was -fsyntax-only but you're right, it still produces object code.

thiagomacieira commented 1 year ago

Comparing to regular Clang:

$ clang --version
clang version 16.0.2
Target: x86_64-w64-windows-gnu
Thread model: posix
InstalledDir: C:/msys/mingw64/bin
$ time sh -c 'for ((i=0;i<10;++i)); do clang -c test.cpp; done'
sh -c 'for ((i=0;i<10;++i)); do clang -c test.cpp; done'  0.06s user 0.20s system 5% cpu 4.663 total
$ time sh -c 'for ((i=0;i<10;++i)); do clang -c -march=sapphirerapids test.cpp; done'
sh -c 'for ((i=0;i<10;++i)); do clang -c -march=sapphirerapids test.cpp; done  0.00s user 0.15s system 3% cpu 4.573 total

There's no slow-down when adding the option, but it wouldn't be expected anyway because the entire header gets parsed.

Unfortunately, this is a different build of LLVM and a different C library, so the numbers aren't directly comparable.

LuoYuanke commented 1 year ago

Maybe allow user to specify a macro to include all intrinsics files.

#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
    defined(__AVX2__) || defined(_INCLUDE_ALL_INTRINSICS)
#include <avx2intrin.h>
#endif

thiagomacieira commented 1 year ago

That has the problem that it wouldn't work if immintrin.h has already been included, which it would be if MS STL is using it. An upgrade to STL may break existing code because of that.

wangwenx190 commented 1 year ago

I removed all the header guards in that file and the compilation speed is indeed significantly slow when building Qt. However, it seems it's still faster than MSVC, at least not slower. I think even if it becomes 1000x slower, it's still a lot better than compilation errors ...

thiagomacieira commented 1 year ago

Significantly slower than what? The code doesn't compile at all without that change, so there isn't much you can compare to.

wangwenx190 commented 1 year ago

Significantly slower than what? The code doesn't compile at all without that change, so there isn't much you can compare to.

In fact, before clang-cl report that error, it can still compile hundreds of files, mostly the bundled 3rd party libraries, Qt's bootstrap library and some command line tools such as moc. I observed the file names flash very fast and the compiled file count increased really fast before I patch immintrin.h, however, after I patch it, the compilation speed decreased so much that I can see the slow down by my eyes, but I think it's not worse than MSVC.

bebuch commented 1 year ago

I ran into this today too. A fix would be great.

echristo commented 1 year ago

@rnk as well.

bebuch commented 11 months ago

Should this issue have a bug label? After all, the behavior does not correspond to the expected behavior and it prevents the use of clang-cl under Windows for various libraries. (Qt for example.)

MaxEW707 commented 10 months ago

This is still broken in mainline clang.

#if !(defined(_MSC_VER) || defined(__SCE__)) || __has_feature(modules) ||      \
    defined(__AVX__)
#include <avxintrin.h>
#endif

I just ran into this when porting our internal projects to clang-cl. This breaks all assumptions that every cross-platform software has about the SSE/AVX includes since MSVC, GCC and Clang all have the same behaviour for these includes except ClangCL which differs. Side note that __SCE__ compiler also works as expected since we can guarantee those platforms have AVX support and thus __AVX__ is always defined but that is as much as I'll comment on there.

Also not all projects use the STL and actively avoid vendor STLs for one of the reasons outlined above; the insane amount of header includes. While I understand the reasoning about compile-times with MSVC STL, we actively internally do many tricks to also wrangle compile-times, it isn't a valid assumption for all software projects that use clang since the usage of STL isn't ubiquitous.

In the interim I can work around it by using the __builtin_* functions where I can but that isn't a solution as a whole.

Anyways when I am free this weekend assuming there isn't already a PR up to fix this I am going to get one up to try to push a solution forward that satisfies the concerns around MSVC STL and users who need to be able to do run-time detection of cpu support for SSE/AVX.

rnk commented 10 months ago

+@zmodem @nico

We should revisit this, it is unfortunate that the only way to use Intel intrinsics with clang-cl is to add additional command line flags. Both GCC and MSVC can call these intrinsics with only local source changes, either via target attributes or simply directly calling various AVX intrinsics.

I believe Intel's last proposal for addressing the compile time concerns was to ship a module map for Clang builtin intrinsic headers, but I think that hasn't advanced because folks are concerned about establishing a hard dependency on Clang header modules. They interfere with pre-processing, crash reproduction, and distributed build systems, and are not entirely aligned with C++ standard modules.

Perhaps another avenue for addressing the compile time concerns would be go down the path of providing an intrin0.h file similar to MSVC, which declares the minimal set of intrinsics that the MSVC STL needs, and then we could allow immintrin.h to be the expensive, catch-all, umbrella header that Intel seems to want it to be.

MaxEW707 commented 10 months ago

Perhaps another avenue for addressing the compile time concerns would be go down the path of providing an intrin0.h file similar to MSVC

That sounds good to me. I'll try to give it a whirl this weekend and get a PR up if someone else doesn't beat me to it :).

then we could allow immintrin.h to be the expensive, catch-all, umbrella header that Intel seems to want it to be.

This is actually a major compile-time cost for us. Includes headers including all previous headers such as emmintrin.h including xmmintrin.h. Most of the untangling there can be accomplished by forward declaring the vector types which is easy with clang due to the vector size attribute. For example emmintrin.h only needs __m128 for the conversion functions. So it is included just to get the vector typedef which then includes mmintrin.h and mm_malloc.h.

Especially true in games where a lot of these headers end up being included from your math library that is mostly only header files to ensure functions have a chance to be inlined which basically bloats every source file. I am in the process of writing our own SSE headers internally to combat some of this but the less platform/toolchain specific ifdefs that are required the better in my opinion.

The PR I was going to get up would allow users to include clang specific isolated headers such as avxintrin.h directly. The intel headers can still do whatever intel desires but if a user targets clang then they can pick exactly what they desire without all the transitive includes.

The git blame for, https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/avxintrin.h#L11, shows that these checks were added since gcc has this behaviour. However removing these checks still allows code following gcc semantics to compile on clang. The inverse isn't true where code built solely for clang will not immediately work on gcc but I think that is a fine concession considering all the already clang specific attributes and behaviours.

My 2c.

carewolf commented 10 months ago

I would prefer if we could use different defines or compile time flags. In a project supporting multiple compilers that is preferable over having different includes.

Could we just have a define that unlocks non-target intrinsics? -D_CLANG_CL_INTRINSICS?

llvm / llvm-project

clang-cl can not support function targets #53520