lhmouse / mcfgthread

Cornerstone of the MOST efficient std::thread on Windows for mingw-w64
https://gcc-mcf.lhmouse.com/
Other
269 stars 28 forks source link

Please reconsider implementing the standard memory functions yourself #58

Closed CarterLi closed 2 years ago

CarterLi commented 2 years ago

rep movsb/stosb is fast ( with large and aligned memory ) because of the ERMSB optimization.

But repz cmpsb is not the case. It's not optimized.

RtlCompareMemory is slow because it uses repz cmpsb too. We can use memcmp in ntdll, which results in both less code and better performance.

#include <chrono>
#include <cstddef>
#include <cstdint>
#include <cstdio>
#include <Windows.h>

inline __attribute__((always_inline))
uint8_t __MCF_mequal(const void *__src, const void *__cmp, size_t __size) noexcept
{
  uint8_t __result;
#if defined(__i386__) || defined(__amd64__)
  intptr_t __si, __di, __cx;
  __asm__(
      "xorl %%eax, %%eax;"
      "repz cmpsb;"
#ifdef __GCC_ASM_FLAG_OUTPUTS__
      : "=@ccz"(__result), "=S"(__si), "=D"(__di), "=c"(__cx)
      : "m"(*(const char(*)[])__src), "m"(*(const char(*)[])__cmp),
        "S"(__src), "D"(__cmp), "c"(__size)
      : "ax"
#else  /* __GCC_ASM_FLAG_OUTPUTS__  */
      "setzb %%al;"
      : "=a"(__result), "=S"(__si), "=D"(__di), "=c"(__cx)
      : "m"(*(const char(*)[])__src), "m"(*(const char(*)[])__cmp),
        "S"(__src), "D"(__cmp), "c"(__size)
      : "cc"
#endif /* __GCC_ASM_FLAG_OUTPUTS__  */
  );
#else
  /* Call the generic but slower version in NTDLL.  */
  SIZE_T __n = RtlCompareMemory(__src, __cmp, __size);
  __result = __n == __size;
#endif
  return __result;
}

// https://github.com/google/benchmark/blob/eacce0b503a81a2910cc1ea0439cf7bc39e3377d/include/benchmark/benchmark.h#L445

template <class Tp>
inline __attribute__((always_inline)) void DoNotOptimize(Tp const& value) {
  asm volatile("" : : "r,m"(value) : "memory");
}

template <class Tp>
inline __attribute__((always_inline)) void DoNotOptimize(Tp& value) {
#if defined(__clang__)
  asm volatile("" : "+r,m"(value) : : "memory");
#else
  asm volatile("" : "+m,r"(value) : : "memory");
#endif
}

int main() {
  using namespace std::chrono;

  HMODULE ntdll = LoadLibraryA("ntdll.dll");
  auto ntdll_memcmp = (decltype(memcmp)*)GetProcAddress(ntdll, "memcmp");

  char buf1[2048] = "";
  char buf2[2048] = "";

  {
    auto start = high_resolution_clock::now();
    for (auto i = 0; i < 100000000; i++) {
      DoNotOptimize(buf1);
      DoNotOptimize(buf2);
      DoNotOptimize(__MCF_mequal(buf1, buf2, 32));
    }
    std::printf("__MCF_mequal small: %llu\n", (uint64_t)(high_resolution_clock::now() - start).count());
  }

  {
    auto start = high_resolution_clock::now();
    for (auto i = 0; i < 100000000; i++) {
      DoNotOptimize(buf1);
      DoNotOptimize(buf2);
      DoNotOptimize(ntdll_memcmp(buf1, buf2, 32));
    }
    std::printf("ntdll_memcmp small: %llu\n", (uint64_t)(high_resolution_clock::now() - start).count());
  }

  {
    auto start = high_resolution_clock::now();
    for (auto i = 0; i < 10000000; i++) {
      DoNotOptimize(buf1);
      DoNotOptimize(buf2);
      DoNotOptimize(__MCF_mequal(buf1, buf2, 2048));
    }
    std::printf("__MCF_mequal large: %llu\n", (uint64_t)(high_resolution_clock::now() - start).count());
  }

  {
    auto start = high_resolution_clock::now();
    for (auto i = 0; i < 10000000; i++) {
      DoNotOptimize(buf1);
      DoNotOptimize(buf2);
      DoNotOptimize(ntdll_memcmp(buf1, buf2, 2048));
    }
    std::printf("ntdll_memcmp large: %llu\n", (uint64_t)(high_resolution_clock::now() - start).count());
  }
}
lhmouse commented 2 years ago

rep movsb/stosb is fast ( with large and aligned memory ) because of the ERMSB optimization.

Yes; and for small blocks of memory it has a huge startup overhead. Those functions have been designed to be small and inlineable. They are not presumed to be fast.

But repz cmpsb is not the case. It's not optimized.

That does not matter. It is there only because GCC requires it. At the moment we make no use of it at all.

RtlCompareMemory is slow because it uses repz cmpsb too. We can use memcmp in ntdll, which results in both less code and better performance.

memcmp() does not exist in NTDLL.DEF from mingw-w64 and would cause undefined references if it was called.

CarterLi commented 2 years ago

We dont use cmpsb but we use memmove

https://github.com/lhmouse/mcfgthread/blob/ac3cf22f602a15796992d72e52c49e91eb0b38d0/src/dtor_queue.c

Add GCC use these functions internally, that's why GCC requires it.

https://gcc.godbolt.org/z/f889cYc6e

And

They are exported from the DLL for external use

memcmp() does not exist in NTDLL.DEF from mingw-w64 and would cause undefined references if it was called.

What about kernel*.dll? No

lhmouse commented 2 years ago

The fact that 'we make no use of it' does not mean we make no use of all of them.