neverchanje / notes

1 stars 0 forks source link

simd #4

Open neverchanje opened 6 years ago

neverchanje commented 6 years ago

SIMD

Single Instruction, Multiple Data

tutorial: http://www.cs.uu.nl/docs/vakken/magr/2017-2018/files/SIMD%20Tutorial.pdf

image

rapidjson 使用 SIMD 的优化

使用 simd 跳过 json 中的空白符(' ', '\n', '\t', '\r'),在某些 workload 下,json 可能有许多空白字符,例如在复杂嵌套下,不使用紧凑写法,可以看到每个 { 前面都有很多空格:

{
  ...{
       ...{
            ...{
            ...}
       ...}
  ...}
}

通常来讲,跳过空白字符的写法是:

for(; *p!='\0'; p++) {
    if (*p == ' ' || *p == '\n' || *p == '\r' || *p == '\t')
        ++p;
    else
        break;
}

但考虑利用 SIMD,我们可以一个指令一次性比对 16 个字符,而不再是一次只比对一个,refer: https://zhuanlan.zhihu.com/p/20037058

#include <nmmintrin.h>

//! Skip whitespace with SSE 4.2 pcmpistrm instruction, testing 16 8-byte characters at once.
inline const char *SkipWhitespace_SIMD(const char* p) {:
    // Fast return for single non-whitespace
    if (*p == ' ' || *p == '\n' || *p == '\r' || *p == '\t')
        ++p;
    else
        return p;

    // 16-byte align to the next boundary
    const char* nextAligned = reinterpret_cast<const char*>((reinterpret_cast<size_t>(p) + 15) & static_cast<size_t>(~15));
    while (p != nextAligned)
        if (*p == ' ' || *p == '\n' || *p == '\r' || *p == '\t')
            ++p;
        else
            return p;

    // The rest of string using SIMD
    static const char whitespace[16] = " \n\r\t";
    const __m128i w = _mm_loadu_si128(reinterpret_cast<const __m128i *>(&whitespace[0]));

    for (;; p += 16) {
        const __m128i s = _mm_load_si128(reinterpret_cast<const __m128i *>(p));
        const int r = _mm_cmpistri(w, s, _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_LEAST_SIGNIFICANT | _SIDD_NEGATIVE_POLARITY);
        if (r != 16)    // some of characters is non-whitespace
            return p + r;
    }
}

inline const char *SkipWhitespace_SIMD(const char* p, const char* end) {
    // Fast return for single non-whitespace
    if (p != end && (*p == ' ' || *p == '\n' || *p == '\r' || *p == '\t'))
        ++p;
    else
        return p;

    // The middle of string using SIMD
    static const char whitespace[16] = " \n\r\t";
    const __m128i w = _mm_loadu_si128(reinterpret_cast<const __m128i *>(&whitespace[0]));

    for (; p <= end - 16; p += 16) {
        const __m128i s = _mm_loadu_si128(reinterpret_cast<const __m128i *>(p));
        const int r = _mm_cmpistri(w, s, _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_LEAST_SIGNIFICANT | _SIDD_NEGATIVE_POLARITY);
        if (r != 16)    // some of characters is non-whitespace
            return p + r;
    }

    return SkipWhitespace(p, end);
}

两个指令 _mm_loadu_si128_mm_cmpistri,前者用于将 16-byte 对齐的数据放入寄存器,后者对两个寄存器内的数据进行比对。

性能测试

从测试用例 1 可以看到一次处理 16 个字符相比于一次只处理 1 个,性能提升是 10 倍左右,符合我们的预期。