intel / ARM_NEON_2_x86_SSE

The platform independent header allowing to compile any C/C++ code containing ARM NEON intrinsic functions for x86 target systems using SIMD up to AVX2 intrinsic functions
Other
430 stars 149 forks source link

vcltq_u8 invalid results #20

Closed mikaheli closed 5 years ago

mikaheli commented 5 years ago

After commit a332e45af06be51d2c34059d80e792a6db4643b5 vcltq_u8 (and probably other unsigned comparisons) produce invalid results.

Example code to reproduce:

#include <stdio.h>
#include <stdint.h>

#define NEON2SSE_DISABLE_PERFORMANCE_WARNING
#include "ARM_NEON_2_x86_SSE/NEON_2_SSE.h"

// gcc -Wall -Wno-unused-function -Wno-sequence-point -msse4 -o test test.c

void print_u8x16(const char *msg, uint8x16_t vec)
{
    printf("%s vector\n", msg);
    printf("  [0-3]   %3u %3u %3u %3u\n",
           vgetq_lane_u8(vec, 0),
           vgetq_lane_u8(vec, 1),
           vgetq_lane_u8(vec, 2),
           vgetq_lane_u8(vec, 3));
    printf("  [4-7]   %3u %3u %3u %3u\n",
           vgetq_lane_u8(vec, 4),
           vgetq_lane_u8(vec, 5),
           vgetq_lane_u8(vec, 6),
           vgetq_lane_u8(vec, 7));
    printf("  [8-11]  %3u %3u %3u %3u\n",
           vgetq_lane_u8(vec, 8),
           vgetq_lane_u8(vec, 9),
           vgetq_lane_u8(vec, 10),
           vgetq_lane_u8(vec, 11));
    printf("  [12-15] %3u %3u %3u %3u\n",
           vgetq_lane_u8(vec, 12),
           vgetq_lane_u8(vec, 13),
           vgetq_lane_u8(vec, 14),
           vgetq_lane_u8(vec, 15));
    printf("\n");
}

int main(void)
{   
    uint8_t testdata[16] =
    {
        120, 121, 122, 123, 124, 125, 126, 127,
        128, 129, 130, 131, 132, 249, 250, 251
    };

    uint8x16_t threshold_u8x16 = vdupq_n_u8(250);

    uint8x16_t testdata_u8x16 = vld1q_u8(testdata);

    // From ARM documentation:
    //
    // If the comparison is true for a lane, the result in that lane is all bits
    // set to one. If the comparison is false for a lane, all bits are set to
    // zero.
    //
    // After commit a332e45af06be51d2c34059d80e792a6db4643b5 we get invalid
    // result.
    // 
    // Expected result:
    //
    // Comparison result vector
    //   [0-3]   255 255 255 255
    //   [4-7]   255 255 255 255
    //   [8-11]  255 255 255 255
    //   [12-15] 255 255   0   0
    // 
    // Current result:
    //
    // Comparison result vector
    //   [0-3]     0   0   0 255
    //   [4-7]   255 255 255 255
    //   [8-11]  255 255 255 255
    //   [12-15] 255 255   0   0
    uint8x16_t belowmask_u8x16 = vcltq_u8(testdata_u8x16, threshold_u8x16);

    print_u8x16("Input", testdata_u8x16);
    print_u8x16("Threshold", threshold_u8x16);
    print_u8x16("Comparison result", belowmask_u8x16);

    return 0;
}
zihanhuang commented 5 years ago

Same here. And vcgtq_u8(a, b) give wrong answer if a >= b + 128. Output should not be 0.

hakanardo commented 5 years ago

Found the same issue with vcltq_u16. The trick used to implement these ops wont work if the difference is too big.

Zvictoria commented 5 years ago

Thanks a mega-sorry for delay with fixing it. Will do my best to fix it ASAP.

Zvictoria commented 5 years ago

Please get the latest commit, it has the corresponding functions fixed. Thanks again

hakanardo commented 5 years ago

Thanx! It seems to be working fine. Great work with this lib!