llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.81k stars 11.45k forks source link

[feature-request] make vector intrinsics constexpr #30794

Open 54aefcd4-c07d-4252-8441-723563c8826f opened 7 years ago

54aefcd4-c07d-4252-8441-723563c8826f commented 7 years ago
Bugzilla Link 31446
Version trunk
OS All
Depends On llvm/llvm-project#20531 llvm/llvm-project#41806
CC @Bigcheese,@topperc,@erichkeane,@filcab,@RKSimon,@zygoloid,@rotateright

Extended Description

Linear algebra libraries like Eigen3 explicitly vectorize their code. However, because vector intrinsics (SSE, AVX,...) are not constexpr, it is impossible for them to provide an interface that can easily be used both at compile-time and run-time.

Duplicating all their code for running at compile-time is not an option.

A first step towards allowing these libraries to be usable within constant expressions would be to make the vector intrinsics constexpr and allowing their evaluation at compile-time.

RKSimon commented 2 years ago

mentioned in issue llvm/llvm-project#41806

RKSimon commented 4 years ago

A small initial demo patch: https://reviews.llvm.org/D86229

RKSimon commented 4 years ago

To add constexpr tags to the intrinsic headers we need to ensure C builds can successfully ignore it - would it make sense to add support for a constexpr style tag to clang? I don't known if we have an equivalent to this already.

An alternative would be to add a #define wrapped inside a cplusplus/has_feature(cxx_constexpr) check - we already have the __DEFAULT_FN_ATTRS local defines which we could hijack + add constexpr variants on cpp builds.

RKSimon commented 4 years ago

Some initial constexpr support for vector types landed in D79755/30c16670e42, so we can begin investigating adding the constexpr tag to some basic sse/avx intrinsics that are implemented as generics (add/sub/mul/div/and/or/xor).

The next step would be to add constexpr support for vector initialization and the convertvector/shufflevector builtins.

RKSimon commented 7 years ago

Before even looking at any intrinsics, we'd need to get the vector types working with constexpr:

typedef int __v4si __attribute__((__vector_size__(16)));

constexpr int sum_i32(__v4si x) {
  return x[0] + x[1] + x[2] + x[3];
}

constexpr __v4si add_i32(__v4si x, __v4si y) {
  return x + y;
}

error: constexpr function never produces a constant expression [-Winvalid-constexpr]

subexpression not valid in a constant expression
54aefcd4-c07d-4252-8441-723563c8826f commented 7 years ago

FWIW here is the list of intrinsics used by Eigen3-trunk in case you want to use them to prioritize.

I think, however, that a better solution would be to be able to detect when a constexpr function is being evaluated at compile-time, and be able to branch on that (e.g. Eigen3 already has these works around available for targets that do not expose the intrinsics), hence I've filled a new bug for this:

https://llvm.org/bugs/show_bug.cgi?id=31917

The list of intrinsics used by Eigen:

_mm256_add_pd _mm256_add_ps _mm256_addsub_pd _mm256_addsub_ps _mm256_and_pd _mm256_and_ps _mm256_andnot_pd _mm256_andnot_ps _mm256_blend_pd _mm256_blend_ps _mm256_blendv_pd _mm256_blendv_ps _mm256_broadcast_pd _mm256_broadcast_ps _mm256_broadcast_sd _mm256_broadcast_ss _mm256_broadcastsd_pd _mm256_castpd256_pd128 _mm256_castpd_ps _mm256_castps128_ps256 _mm256_castps256_ps128 _mm256_castps_pd _mm256_castps_si256 _mm256_castsi128_si256 _mm256_castsi256_pd _mm256_castsi256_ps _mm256_castsi256_si128 _mm256_ceil_pd _mm256_ceil_ps _mm256_cmp_pd _mm256_cmp_ps _mm256_cvtepi32_ps _mm256_cvtpd_epi32 _mm256_cvtph_ps _mm256_cvtps_epi32 _mm256_cvtps_ph _mm256_cvttps_epi32 _mm256_div_pd _mm256_div_ps _mm256_extract_epi16 _mm256_extractf128_pd _mm256_extractf128_ps _mm256_extractf128_si256 _mm256_floor_pd _mm256_floor_ps _mm256_fmadd_pd _mm256_fmadd_ps _mm256_hadd_pd _mm256_hadd_ps _mm256_i32gather_pd _mm256_i32gather_ps _mm256_insertf128_ps _mm256_insertf128_si256 _mm256_load_pd _mm256_load_ps _mm256_load_si256 _mm256_loadu2_m128d _mm256_loadu_pd _mm256_loadu_ps _mm256_loadu_si256 _mm256_max_pd _mm256_max_ps _mm256_min_pd _mm256_min_ps _mm256_movedup_pd _mm256_movehdup_ps _mm256_moveldup_ps _mm256_mul_pd _mm256_mul_ps _mm256_mullo_epi32 _mm256_or_pd _mm256_or_ps _mm256_permute2f128_pd _mm256_permute2f128_ps _mm256_permute2x128_si256 _mm256_permute_pd _mm256_permute_ps _mm256_round_pd _mm256_round_ps _mm256_rsqrt_ps _mm256_set1_epi16 _mm256_set1_epi32 _mm256_set1_pd _mm256_set1_ps _mm256_set_epi16 _mm256_set_epi32 _mm256_set_pd _mm256_set_ps _mm256_setr_epi32 _mm256_setzero_pd _mm256_setzero_ps _mm256_setzero_si256 _mm256_shuffle_pd _mm256_shuffle_ps _mm256_slli_epi32 _mm256_sqrt_pd _mm256_sqrt_ps _mm256_srli_epi32 _mm256_store_pd _mm256_store_ps _mm256_store_si256 _mm256_storeu_pd _mm256_storeu_ps _mm256_storeu_si256 _mm256_sub_pd _mm256_sub_ps _mm256_unpackhi_epi16 _mm256_unpackhi_epi32 _mm256_unpackhi_epi64 _mm256_unpackhi_ps _mm256_unpacklo_epi16 _mm256_unpacklo_epi32 _mm256_unpacklo_epi64 _mm256_unpacklo_ps _mm256_xor_pd _mm256_xor_ps _mm512_abs_ps _mm512_add_epi64 _mm512_add_pd _mm512_add_ps _mm512_and_pd _mm512_and_ps _mm512_and_si512 _mm512_andnot_pd _mm512_andnot_ps _mm512_broadcastsd_pd _mm512_broadcastss_ps _mm512_castsi512_pd _mm512_castsi512_ps _mm512_cmp_pd_mask _mm512_cmp_ps_mask _mm512_cvtepi32_ps _mm512_cvtepu32_epi64 _mm512_cvtpd_epi64 _mm512_cvtph_ps _mm512_cvtps_ph _mm512_cvttps_epi32 _mm512_div_pd _mm512_div_ps _mm512_extractf32x4_ps _mm512_extractf32x8_ps _mm512_extractf64x4_pd _mm512_extracti32x4_epi32 _mm512_floor_ps _mm512_fmadd_pd _mm512_fmadd_ps _mm512_i32gather_pd _mm512_i32gather_ps _mm512_i32scatter_pd _mm512_i32scatter_ps _mm512_insertf32x4 _mm512_insertf32x8 _mm512_insertf64x2 _mm512_insertf64x4 _mm512_load_pd _mm512_load_ps _mm512_load_si512 _mm512_loadu_pd _mm512_loadu_ps _mm512_loadu_si512 _mm512_mask_blend_pd _mm512_mask_blend_ps _mm512_max_pd _mm512_max_ps _mm512_min_pd _mm512_min_ps _mm512_mul_pd _mm512_mul_ps _mm512_mul_round_pd _mm512_mullo_epi32 _mm512_or_pd _mm512_or_ps _mm512_permute_ps _mm512_permutexvar_pd _mm512_permutexvar_ps _mm512_rsqrt14_pd _mm512_rsqrt14_ps _mm512_rsqrt28_ps _mm512_set1_epi32 _mm512_set1_epi64 _mm512_set1_pd _mm512_set1_ps _mm512_set_epi32 _mm512_set_pd _mm512_set_ps _mm512_setzero_pd _mm512_setzero_ps _mm512_shuffle_pd _mm512_shuffle_ps _mm512_slli_epi32 _mm512_slli_epi64 _mm512_sqrt_pd _mm512_sqrt_ps _mm512_srli_epi32 _mm512_store_pd _mm512_store_ps _mm512_storeu_pd _mm512_storeu_ps _mm512_storeu_si512 _mm512_sub_pd _mm512_sub_ps _mm512_undefined_pd _mm512_undefined_ps _mm512_unpackhi_pd _mm512_unpackhi_ps _mm512_unpacklo_pd _mm512_unpacklo_ps _mm512_xor_pd _mm512_xor_ps _mm_abs_epi32 _mm_add_epi32 _mm_add_pd _mm_add_ps _mm_add_sd _mm_add_ss _mm_addsub_pd _mm_addsub_ps _mm_alignr_epi8 _mm_and_pd _mm_and_ps _mm_and_si128 _mm_andnot_pd _mm_andnot_ps _mm_andnot_si128 _mm_blend_pd _mm_blend_ps _mm_blendv_epi8 _mm_blendv_pd _mm_blendv_ps _mm_broadcast_ss _mm_castpd_ps _mm_castpd_si128 _mm_castps_pd _mm_castps_si128 _mm_castsi128_pd _mm_castsi128_ps _mm_ceil_pd _mm_ceil_ps _mm_cmpeq_epi32 _mm_cmpeq_pd _mm_cmpeq_ps _mm_cmpge_ps _mm_cmpgt_epi32 _mm_cmpgt_pd _mm_cmpgt_ps _mm_cmple_ps _mm_cmplt_epi32 _mm_cmplt_ps _mm_cmpnge_ps _mm_cvtepi32_pd _mm_cvtepi32_ps _mm_cvtm64_si64 _mm_cvtpd_ps _mm_cvtps_pd _mm_cvtsd_f64 _mm_cvtsi128_si32 _mm_cvtsi64_m64 _mm_cvtsi64_si32 _mm_cvtss_f32 _mm_cvttpd_epi32 _mm_cvttps_epi32 _mm_div_pd _mm_div_ps _mm_extract_epi16 _mm_extract_epi32 _mm_floor_pd _mm_floor_ps _mm_fmadd_pd _mm_fmadd_ps _mm_hadd_epi32 _mm_hadd_pd _mm_hadd_ps _mm_load1_ps _mm_load_pd _mm_load_pd1 _mm_load_ps _mm_load_ps1 _mm_load_sd _mm_load_si128 _mm_load_ss _mm_loaddup_pd _mm_loadh_pi _mm_loadl_epi64 _mm_loadl_pi _mm_loadu_pd _mm_loadu_ps _mm_loadu_si128 _mm_max_epi32 _mm_max_pd _mm_max_ps _mm_max_sd _mm_max_ss _mm_min_epi32 _mm_min_pd _mm_min_ps _mm_min_sd _mm_min_ss _mm_move_sd _mm_move_ss _mm_movedup_pd _mm_movehdup_ps _mm_movehl_ps _mm_moveldup_ps _mm_movelh_ps _mm_mul_epu32 _mm_mul_pd _mm_mul_ps _mm_mul_sd _mm_mul_ss _mm_mullo_epi32 _mm_or_pd _mm_or_ps _mm_or_si128 _mm_permute_ps _mm_prefetch _mm_round_pd _mm_round_ps _mm_rsqrt_ps _mm_set1_epi16 _mm_set1_epi32 _mm_set1_pd _mm_set1_pi16 _mm_set1_ps _mm_set_epi16 _mm_set_epi32 _mm_set_pd _mm_set_pi16 _mm_set_ps _mm_set_ps1 _mm_set_sd _mm_set_ss _mm_setr_epi32 _mm_setzero_pd _mm_setzero_ps _mm_setzero_si128 _mm_shuffle_epi32 _mm_shuffle_pd _mm_shuffle_ps _mm_slli_epi32 _mm_slli_epi64 _mm_sqrt_pd _mm_sqrt_ps _mm_sqrt_ss _mm_srai_epi32 _mm_srli_epi32 _mm_srli_epi64 _mm_store_pd _mm_store_ps _mm_store_si128 _mm_storel_pi _mm_storeu_pd _mm_storeu_ps _mm_storeu_si128 _mm_sub_epi32 _mm_sub_pd _mm_sub_ps _mm_unpackhi_epi16 _mm_unpackhi_epi32 _mm_unpackhi_epi64 _mm_unpackhi_pd _mm_unpackhi_pi16 _mm_unpackhi_pi32 _mm_unpackhi_ps _mm_unpacklo_epi16 _mm_unpacklo_epi32 _mm_unpacklo_epi64 _mm_unpacklo_pd _mm_unpacklo_pi16 _mm_unpacklo_pi32 _mm_unpacklo_ps _mm_xor_pd _mm_xor_ps _mm_xor_si128

54aefcd4-c07d-4252-8441-723563c8826f commented 7 years ago

There is prior art about such intrinsics in the D language, which exposes the __ctfe intrinsic, quoting their docs:

The ctfe boolean pseudo-variable, which evaluates to true at compile time, but false at run time, can be used to provide an alternative execution path to avoid operations which are forbidden at compile time. Every usage of ctfe is evaluated before code generation and therefore has no run-time cost, even if no optimizer is used.

http://dlang.org/spec/function.html

54aefcd4-c07d-4252-8441-723563c8826f commented 7 years ago

That does sound like a titanic task indeed. I've posted this issue in the Eigen's bugzilla issue asking for help on prioritizing intrinsics.

I am however worried that constant evaluation support for the intrinsics is the wrong approach for tackling this.

I would be grateful if some of you could ping Richard Smith and Faisal Vali on this issue.

Would it be possible to:

This would allow clang to provide constexpr overloads of the intrinsics in plain C++ in the intrinsics headers, by providing implementations that are specific for the constant expression evaluator.

While not everyone could hack on the compiler, this would allow everybody to workaround limitations like this.

topperc commented 7 years ago

It's not as simple as just marking them constexpr. A constant evaluator for each intrinsic would need to implemented in clang in lib/AST/ExprConstant.cpp. The backend support for the intrinsics cannot be used for this.

This would be a huge amount of effort to support all of the SSE/AVX intrinsics. So if we were to do this we would need guidance on which ones are most important in order to prioritize.

54aefcd4-c07d-4252-8441-723563c8826f commented 7 years ago

Not really, this bug is only about the intrinsic builtin functions not being marked with the C++ constexpr keyword.

rotateright commented 7 years ago

Relating bug 21713 which requested that that these be equivalent:

__m128i
emul84(__m128i a, int64_t b, const int ndx) {
    if (ndx)
        a = _mm_insert_epi32(a, b, 1);
    else
        a = _mm_insert_epi32(a, b, 0);
        return a;
}

__m128i
emul85(__m128i a, int64_t b, const int ndx) {
    a = _mm_insert_epi32(a, b, !!ndx);
        return a;
}

See also bug 27177 (is this a duplicate of that bug?).

54aefcd4-c07d-4252-8441-723563c8826f commented 7 years ago

Do you have examples of intrinsics that are causing performance issues because they don't constant fold?

No, but this issue is not about that.

If they are not marked constexpr, they cannot be used in constexpr functions, so a library that uses cannot provide a constexpr API.

That is, because Eigen3 uses intrinsics internally for run-time behavior, it cannot provide a constexpr API that works at compile-time.

RKSimon commented 7 years ago

Do you have examples of intrinsics that are causing performance issues because they don't constant fold?

RKSimon commented 7 years ago

FYI: X86 has been making progress with converting intrinsics (or certain use cases of them) to generic IR: within the headers, via CGBuiltin.h or during InstCombine. We also have constant folding for some target opcodes during DAGCombine. Basic arithmetic, shifts, conversions and shuffles in particular have been converted.