simd-everywhere / simde

Implementations of SIMD instruction sets for systems which don't natively support them.
https://simd-everywhere.github.io/blog/
MIT License
2.32k stars 239 forks source link

The int128 implementation of simde_mm_testz_si128() is incorrect #1119

Open markos opened 8 months ago

markos commented 8 months ago

The int128 implementation of simde_mm_testz_si128() is incorrect, caught in vectoscan CI with SIMDe backend:

https://buildbot-ci.vectorcamp.gr/#/builders/119/builds/14

After investigation I found that for x86 the int128 implementation was used and only these tests were failing. The other architectures used the u64 method and they had correct results.

https://github.com/simd-everywhere/simde/blob/5405bbdcc7e9045b0901c847c8868594deb511a6/simde/x86/sse4.1.h#L2344

The following fix seems to work and it makes sense as it seems to be consistent with the u64 implementation below.

diff --git a/simde/x86/sse4.1.h b/simde/x86/sse4.1.h
index 15a197b9..ad583bd1 100644
--- a/simde/x86/sse4.1.h
+++ b/simde/x86/sse4.1.h
@@ -2341,10 +2341,10 @@ simde_mm_testz_si128 (simde__m128i a, simde__m128i b) {
       v128_t m = wasm_v128_and(a_.wasm_v128, b_.wasm_v128);
       return (wasm_i64x2_extract_lane(m, 0) | wasm_i64x2_extract_lane(m, 1)) == 0;
     #elif defined(SIMDE_HAVE_INT128_)
-      if ((a_.u128[0] & b_.u128[0]) == 0) {
-        return 1;
+      if ((a_.u128[0] & b_.u128[0]) > 0) {
+        return 0;
       }
-      return 0;
+      return 1;
     #else
       for (size_t i = 0 ; i < (sizeof(a_.u64) / sizeof(a_.u64[0])) ; i++) {
         if ((a_.u64[i] & b_.u64[i]) > 0)
mr-c commented 8 months ago

To document our conversation from another place:

The int128 implementation was introduced in https://github.com/simd-everywhere/simde/commit/f132275f85ab1c1cb1e890538ee552c11ca09c38#diff-c34749104c5a84a98b2e0af443fc6df77276e235abd1cff17a74d285c4fddd93R2344

From what I see, it still matches the guidance from Intel for _mm_testz_si128: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_testz_si128&ig_expand=6857

Compute the bitwise AND of 128 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. [Something about CF, but that is not used in this function] Return the ZF value.

The other pathway has reversed structure to enable fast failing, as it compares 64 bits at a time; so if the bitwise AND of the first 64 bits is not zero, we return 0 immediately