NEON: more fp16 using intrinsics supported by architecture v7

yyctw commented 11 months ago

Hi all, this is Eric from Andes Technology Corporation. This PR includes

Add the types simde_float16x4x{3/4}_t and simde_float16x8x{3/4}_t Add 351 initial implementations and corresponding test cases in 63 families which are listed below:

abal, abal_high, cale, calt, create, cvt, cvt_n, cvtn, dup_lane, ext
fma, fma_lane, fma_n, fms, fms_lane, fms_n, get_lane
ld1_dup, ld1_lane, ld1_x2, ld1_x3, ld1_x4, ld1q_x2, ld1q_x3, ld1q_x4
ld2, ld2_dup, ld2_lane, ld3, ld3_dup, ld3_lane, ld4, ld4_dup, ld4_lane
mla_lane, mlal_high_lane, mls_lane, mlsl_high_lane, mul_lane, neg
qdmlal, qdmlal_high, qdmlal_high_lane, qdmlal_high_n, qdmlal_lane, qdmlal_n
qdmlsl, qdmlsl_high, qdmlsl_high_lane, qdmlsl_high_n, qdmlsl_lane, qdmlsl_n
qdmull, qdmull_high, qdmull_high_lane, qdmull_high_n, qdmull_lane, qdmull_n
qdmulh, qdmulh_lane, qshl, reinterpret, sqrt

"macOS (version 14.2, macos-13)" was the only test that failed on my fork, and it occurred during the "Install Homebrew Dependencies" stage, but all the other CI tests passed smoothly. Thanks for reading and any recommendations are welcome!

mr-c commented 11 months ago

Thank you @yyctw !

Please review https://app.circleci.com/pipelines/github/simd-everywhere/simde/1139/workflows/b6f035be-5458-4865-b49c-fb22d4d49335/jobs/3138/parallel-runs/0/steps/0-112

mr-c commented 11 months ago

Looks like the msvc build also has compliants: https://ci.appveyor.com/project/nemequ/simde/builds/48267547/job/vgv72gurd4e0s202#L1856

mr-c commented 11 months ago

Test errors on Fedora i386 (ignore the avx512 failures) https://download.copr.fedorainfracloud.org/results/packit/simd-everywhere-simde-1075/fedora-rawhide-i386/06522193-simde/builder-live.log.gz (source)

CircleCI got the x86 32 bit build finished, but experienced test failures: https://app.circleci.com/pipelines/github/simd-everywhere/simde/1139/workflows/b6f035be-5458-4865-b49c-fb22d4d49335/jobs/3138/parallel-runs/0/steps/0-112

yyctw commented 11 months ago

Thank you @yyctw !

Please review https://app.circleci.com/pipelines/github/simd-everywhere/simde/1139/workflows/b6f035be-5458-4865-b49c-fb22d4d49335/jobs/3138/parallel-runs/0/steps/0-112

I attempted to build these two failing test cases using the "aarch64-linux-gnu-g++" toolchain with the same compile options that "circleci: i686-gcc11-O2" uses. However, I observed that these two test cases passed successfully on my x86 machine.

Upon further investigation, I noticed that the test cases only fail when built using the "i686-linux-gnu-g++-11" toolchain, while they pass when compiled with "i686-linux-gnu-gcc-11". I guess that there might be some issues or bugs in the "i686-linux-gnu-g++-11" toolchain when using the O2 optimization option.

mr-c commented 11 months ago

Upon further investigation, I noticed that the test cases only fail when built using the "i686-linux-gnu-g++-11" toolchain, while they pass when compiled with "i686-linux-gnu-gcc-11". I guess that there might be some issues or bugs in the "i686-linux-gnu-g++-11" toolchain when using the O2 optimization option.

Yeah, this project often finds new compiler bugs. Can you report this bug to GCC? We'll need a workaround for the affected functions in SIMDe

yyctw commented 11 months ago

Upon further investigation, I noticed that the test cases only fail when built using the "i686-linux-gnu-g++-11" toolchain, while they pass when compiled with "i686-linux-gnu-gcc-11". I guess that there might be some issues or bugs in the "i686-linux-gnu-g++-11" toolchain when using the O2 optimization option.

Yeah, this project often finds new compiler bugs. Can you report this bug to GCC? We'll need a workaround for the affected functions in SIMDe

Sure, I will report it as soon as possible.

Looks like the msvc build also has compliants: https://ci.appveyor.com/project/nemequ/simde/builds/48267547/job/vgv72gurd4e0s202#L1856

It appears that there are some bugs when expanding nested macros, such as SIMDE_CONSITIFY and simde_mla_lane_*. I've manually expanded SIMDE_CONSITIFY and resolved the issue. Many other implementations, like {qd}mls{l}_lane, have similar problems, and I will fix all of them as soon as possible.

mr-c commented 11 months ago

It appears that there are some bugs when expanding nested macros, such as SIMDE_CONSITIFY and simde_mlalane*. I've manually expanded SIMDE_CONSITIFY and resolved the issue. Many other implementations, like {qq}mls{l}_lane, have similar problems, and I will fix all of them as soon as possible.

Yeah, the SIMDE_CONSTIFY_ macros work in the headers, but for MSVC they cause problems in the tests.

mr-c commented 11 months ago

FYI, see https://github.com/simd-everywhere/implementation-status/commit/916de72538b89743016a3c411063e616fc33cf30 for the status of f16 type NEON intrinsics prior to this PR

(I updated the script that generates the implementation status, as it was ignoring functions that use 16-bit floating point types)

yyctw commented 11 months ago

Upon further investigation, I noticed that the test cases only fail when built using the "i686-linux-gnu-g++-11" toolchain, while they pass when compiled with "i686-linux-gnu-gcc-11". I guess that there might be some issues or bugs in the "i686-linux-gnu-g++-11" toolchain when using the O2 optimization option.

Yeah, this project often finds new compiler bugs. Can you report this bug to GCC? We'll need a workaround for the affected functions in SIMDe

Sure, I will report it as soon as possible.

I found that this problem may be caused by variations in the precision of double across different processors [ref]. I resolved it by adding the -ffloat-store flag in the i686-gcc-11-qemu.cross file.

Looks like the msvc build also has compliants: https://ci.appveyor.com/project/nemequ/simde/builds/48267547/job/vgv72gurd4e0s202#L1856

It appears that there are some bugs when expanding nested macros, such as SIMDE_CONSITIFY and simde_mla_lane_*. I've manually expanded SIMDE_CONSITIFY and resolved the issue. Many other implementations, like {qd}mls{l}_lane, have similar problems, and I will fix all of them as soon as possible.

Solved.

yyctw commented 11 months ago

I found that this problem may be caused by variations in the precision of double across different processors [ref]. I resolved it by adding the -ffloat-store flag in the i686-gcc-11-qemu.cross file.

This is a good workaround to document in the README for x86 (32-bit) users, but it is still a compiler bug if different -O optimizations levels produce different math. So we'll need to get a minimal reproducer and file a bug with GCC. Hopefully the failing tests cases will make developing a minimal reproducer easier. Let me know if you need help with that.

As for a workaround, perhaps one of the following applied only for the problematic GCC versions will help: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-optimize-function-attribute https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-sseregparm-function-attribute_002c-x86 https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-target-function-attribute-5 with one or more of no-mmx, no-fancy-math-387, fpmath=sse

If you want to get the bulk of this merged first, feel free to open a new PR that skips the functions that triggers the compiler bug. Then this PR can be rebased and kept until we implement a workaround.

Sure, I'll start by opening a new PR without the functions that trigger the compile errors. After that, I'll report this compilation bug to GCC and look for a workaround for SIMDe.

mr-c commented 11 months ago

@yyctw Now that https://github.com/simd-everywhere/simde/pull/1081 is merged,: do you want to keep this PR to develop the workaround, or will you open a new one?

yyctw commented 11 months ago

@yyctw Now that #1081 is merged,: do you want to keep this PR to develop the workaround, or will you open a new one?

I'll open the new one, this PR can be closed.

simd-everywhere / simde

NEON: more fp16 using intrinsics supported by architecture v7 #1075