loongson-community / discussions

Cross-community issue tracker & discussions / 跨社区工单追踪 & 讨论场所
9 stars 0 forks source link

Proposal: Support `__attribute__((target_clones(XXX)))` #4

Open cthbleachbit opened 10 months ago

cthbleachbit commented 10 months ago

Many computationally intensive software package could make use of LSX and LASX SIMD instruction sets on appropriate hardware. As of gcc 13.2.0, loongarch machine type does not support the use of target nor target_clone attributes.

test.c:1:51: warning: target attribute is not supported on this machine [-Wattributes]
    1 | __attribute__((target_clones("default,lsx"))) int function() {
      |                                            

Target cloning will allow compilers to emit copies of the same function optimized for different instruction sets and a stub with the original symbol name. At runtime the stub chooses the most appropriate code path for current hardware. This means later la64 processors can take advantage of simd without breaking binary compatibility with older processors.

xen0n commented 10 months ago

cc @xry111

xen0n commented 10 months ago

Reading materials:

Work to do:

We'd be better off inviting Loongson people to also participate, and we'd have to hurry up if we want this in GCC 14/LLVM 18.

xen0n commented 10 months ago

cc Loongson Toolchain folks: @ChenghuaXu @scylaac @SixWeining

xry111 commented 10 months ago

Reading materials:

Work to do:

  • design and decide on an ABI for the resolver
  • GCC side:

    • teach libgcc to do something similar to libgcc/config/i386/cpuinfo.c, but using HWCAP like how AArch64 probes LSE.
    • wire up codegen
  • LLVM side:

    • implement the said libgcc interface in compiler-rt
    • wire up clang

We'd be better off inviting Loongson people to also participate, and we'd have to hurry up if we want this in GCC 14/LLVM 18.

The GCC development branch which will become GCC 14 is in general development mode (Stage 1) and will transition to general bugfixing mode (Stage 3) at the start of Nov. 19th and from there to regression and documentation fixing mode (Stage 4) at the start of Jan. 8th.

xry111 commented 10 months ago

One issue: consider we have three target clones for generic, feature A, and feature B. Now should we resolve to the clone for A or B if both A and B are detected at runtime? Yes we may force the compiler to emit another clone for "A + B" in this case, but generally doing so will cause the numbers of target clones to increase exponentially when the number of independent features increases.

xen0n commented 10 months ago

One issue: consider we have three target clones for generic, feature A, and feature B. Now should we resolve to the clone for A or B if both A and B are detected at runtime? Yes we may force the compiler to emit another clone for "A + B" in this case, but generally doing so will cause the numbers of target clones to increase exponentially when the number of independent features increases.

I think there should be some awareness of feature dependencies: for example LASX implies LSX, so if both are present the LASX version should be enough. At least this is the most popular use case right now...

xry111 commented 10 months ago

One issue: consider we have three target clones for generic, feature A, and feature B. Now should we resolve to the clone for A or B if both A and B are detected at runtime? Yes we may force the compiler to emit another clone for "A + B" in this case, but generally doing so will cause the numbers of target clones to increase exponentially when the number of independent features increases.

I think there should be some awareness of feature dependencies: for example LASX implies LSX, so if both are present the LASX version should be enough. At least this is the most popular use case right now...

Yes it's true for now, so maybe we can only support LSX and LASX in target_clone for GCC 14.

Is the resolver interface a part of psABI? If true we'll need to make a future-proof design from the start. But perhaps we don't need a stable ABI at all...

xry111 commented 10 months ago

To me we don't need a stable ABI and even GCC and Clang can use different ways to implement.

Should we just create RFEs in GCC and Clang issue trackers?

xen0n commented 10 months ago

To me we don't need a stable ABI and even GCC and Clang can use different ways to implement.

But if we want any interoperability between compiled objects targeting such ABI, that can frequently happen in the "vendor hands out closed-source .a SDKs that may or may not be compiled with the same compiler flavor/version your project's using" scenario, agreement and stability are definitely wanted.

Should we just create RFEs in GCC and Clang issue trackers?

Sure!

xry111 commented 10 months ago

To me we don't need a stable ABI and even GCC and Clang can use different ways to implement.

But if we want any interoperability between compiled objects targeting such ABI, that can frequently happen in the "vendor hands out closed-source .a SDKs that may or may not be compiled with the same compiler flavor/version your project's using" scenario, agreement and stability are definitely wanted.

I just took a view at x86_64 implementation:

double x __attribute__((vector_size(32)));
double y __attribute__((vector_size(32)));

__attribute__((target_clones("default,avx")))
void test(void)
{
    x += y;
}

With GCC, the symbols are:

0000000000000000 l     F .text  000000000000002f test.default
0000000000000030 l     F .text  000000000000001c test.avx
0000000000000000  w    F .text.test.resolver    000000000000002b test.resolver
0000000000000000 g   i   .text.test.resolver    000000000000002b test

With Clang, the symbols are:

0000000000000000 g     F .text  0000000000000031 test.default.1
0000000000000040 g     F .text  000000000000001c test.avx.0
0000000000000000  w    F .text.test.resolver    0000000000000021 test.resolver
0000000000000000  w  i   .text.test.resolver    0000000000000021 test.ifunc

So there is already some differences... And it seems with the Clang implementation I cannot call "test" from another TU.

xry111 commented 10 months ago

So there is already some differences... And it seems with the Clang implementation I cannot call "test" from another TU.

Correction, with Clang I must declare the test function in the other TU with target_clones attribute too.

I'm not sure which is better in x86_64 GCC and x86_64 Clang models.