lhmouse / mcfgthread

Cornerstone of the MOST efficient std::thread on Windows for mingw-w64
https://gcc-mcf.lhmouse.com/
Other
277 stars 28 forks source link

You can never defeat builtins in special cases. #23

Closed CarterLi closed 7 years ago

CarterLi commented 7 years ago

For example, memcpy for small buffer: https://godbolt.org/g/AkekLW And buffers for memcpy with constant size are usually small and aligned

So don't do that.

lhmouse commented 7 years ago
  1. We don't have anything newer than SSE2 enabled.
  2. We have -ffree-standing enabled so a function call is a function call. If we have a call to memcpy() or __builtin_memcpy() it results in an external call to that function.

Hardcoding inline versions of these functions might sound a bad idea, but it is an effective way to prevent GCC from generating such external function calls completely.

CarterLi commented 7 years ago
lhmouse commented 7 years ago

I have been carefully checking use of these functions. We do have better solutions (indeed SSE2 is a lot better) but it can't be enabled on x86 for historical reasons.

CarterLi commented 7 years ago

We already drop XP/Vista. I see no reason to support CPUs that don't support SSE2 ( which means Pentium 3 and older). For personal use, I'd rather recompile the whole library with -march=native, which means AVX2

CarterLi commented 7 years ago

And don't align memory manually when given pointer is not aligned. Just use *loadu* functions directly. They do have the same performance in modern CPUs. https://software.intel.com/sites/landingpage/IntrinsicsGuide/#cats=Load,Store&techs=SSE2&expand=3144,3083

lhmouse commented 7 years ago

Why do we use -ffree-standing in such a lib for multi-threading? Why we can call external Win32 APIs but an standard & general used C API?

That is because MSVCRT is unavailable. NTDLL does have many C library functions exported however I just want to get them out of my sight. This is true even for MCRCRT which has basically three versions of memcpy() (an inline version using rep movs and inline and external versions using SSE2).

XP support was dropped, but Vista support was not despite being completely untested.

MSYS2 doesn't enable SSE2 for i686 and I have to follow that. Note that Intel is still selling super tiny x86 CPUs that even don't have SSE.

lhmouse commented 7 years ago

And don't align memory manually when given pointer is not aligned. Just use loadu functions directly. They do have the same performance in modern CPUs.

This is only true if the number of instructions is bounded.

CarterLi commented 7 years ago

That is because MSVCRT is unavailable.

That is not the case. 99% of people that use pthread ( mcfgthread in this case ) will use standard C apis. For MinGW MSVCRT is always linked.

MSYS2 doesn't enable SSE2 for i686

OMG... So let's #ifndef

lhmouse commented 7 years ago

That it not the case. 99% of people that use pthread ( mcfgthread in this case ) will use standard C apis. For MinGW MSVCRT is always linked.

I disabled it.

OMG... So let's #ifndef

No just keep the rep movs version. It is generic and efficient enough for small chunks of bytes.

CarterLi commented 7 years ago

Summary:

  1. You add another two functions, which means larger code base, larger binary, and maybe more bugs.
  2. You don't use builtins instead write it yourself, which means you prevent the chance for compiler from optimizing the code. In addition, compiler can't assume that buffers are zero-cleaned or copied, which may prevent other optimizations.
  3. You use inline asm, which means your code is harder to migrate.

I see no reason for such a performance-oriented project to do such a bad thing.

lhmouse commented 7 years ago

You add another two functions, which means larger code base, larger binary, and maybe more bugs.

There may be. There ain't going to be. There is a maintainer.

You don't use builtins instead write it yourself, which means you prevent the chance for compiler from optimizing the code.

I don't want it to.

In addition, compiler can't assume that buffers are zero-cleaned or copied, which may prevent other optimizations.

At the moment, GCC doesn't produce better code given the assumption. I don't think it is going to, either.

You use inline asm, which means your code is harder to migrate.

Basically: I don't care. The code is designed for x64 and x86, and will work on x64 and x86, period. If someone wants to port code to other platforms, just do it. I appreciate it.

I see no reason for such a performance-oriented project to do such a bad thing.

Is it? We are being semantically correct by adding -ffree-standing just because MSVCRT isn't there despite the fact that NTDLL exports memcpy() (which is a synonym of RtlCopyMemory() ) and memset(), aren't we? The result is that __builtin_memcpy() and __builtin_memset() always create function calls that can't be inlined at all, so we have to implement inline ones ourselves. In reality, the inline versions in mcfgthread take up less room and don't contain any branches at all, surpassing the inline ones generated by GCC.

CarterLi commented 7 years ago

Well... Are those functions provided by NTDLL that awful? So why do they provide such a function instead of using the highly optimised one provided by MSVCRT? For stability or something? Instead of providing a new function, can we NTR NTDLL by using the same function name?

lhmouse commented 7 years ago

Such functions provided by NTDLL are usually stable and not so optimized as those by provided MSVCR???.

I pushed a more optimized version of inline_mem.h. See https://godbolt.org/g/4mJ2P0.

lhmouse commented 7 years ago

Well... Are those functions provided by NTDLL that awful? So why do they provide such a function instead of using the highly optimised one provided by MSVCRT? For stability or something?

Microsoft people do have their own consideration. On modern Intel CPUs you can do memcpy with a brute rep movsb and gain maximum throughput that your hardware allows - this is not the case for movsw whatsoever. Is anyone willing to do that? I don't think so.

Instead of providing a new function, can we NTR NTDLL by using the same function name?

No comments.

CarterLi commented 7 years ago

Microsoft people do have their own consideration. On modern Intel CPUs you can do memcpy with a brute rep movsb and gain maximum throughput that your hardware allows - this is not the case for movsw whatsoever. Is anyone willing to do that? I don't think so.

I have never watched the disasm of memcpy provided by NTDLL because I dont know how to call it directly in my code. Im just guessing. But i have watched the disasm of lstrlen(A/W), plain loop is used even without basic unlooping optimizion.

The question is: If the functions provided by NTDLL are fullly optimized, why do you refuse to use them.

Ps: i dont think you have to rep movsb if you need only one mov or two

lhmouse commented 7 years ago

I have never watched the disasm of memcpy provided by NTDLL because I dont know how to call it directly in my code.

Add -lntdll into the command line before other libraries, or use LoadLibrary() + GetProcAddress().

Im just guessing. But i have watched the disasm of lstrlen(A/W), plain loop is used even without basic unlooping optimizion.

On modern processors manual loop unrolling helps little. Nevertheless you might want to have a look at this. strlen(s) is effectively (size_t)(rawmemchr(s, 0) - s). There is a variant for wchar_t too.

The question is: If the functions provided by NTDLL are fullly optimized, why do you refuse to use them.

They aren't. At least they can't be inlined.

Ps: i dont think you have to rep movsb if you need only one mov or two

As a compiled function it just can't know how many bytes to copy/fill. There is no silver bullet.

CarterLi commented 7 years ago

The result is that __builtin_memcpy() and __builtin_memset() always create function calls that can't be inlined at all, so we have to implement inline ones ourselves. In reality, the inline versions in mcfgthread take up less room and don't contain any branches at all, surpassing the inline ones generated by GCC.

https://godbolt.org/g/gb1Nmb

It seems that GCC tries to align pointer to QWORD for unknown mem addresses. is it required by stosq?

EDIT: GCC won't align pointers when compiling with -Os

GCC suggested a clever sollution. If an unpadded buffer size is given, there will be less than 1 DWORD/QWORD remains after stosq/stosd.

eg.

| 0 0 0 0, 0 0 0 0, 0 0 0 0, x x x |

Instead of stosb the last 3 bytes, we can set the last 1 DWORD once.

| 0 0 0 0, 0 0 0 0, 0 0 0 _, _ _ _ |

https://godbolt.org/g/JNbLDu

The result is that __builtin_memcpy() and __builtin_memset() always create function calls that can't be inlined at all, so we have to implement inline ones ourselves.

Which version of GCC are you using?

CarterLi commented 7 years ago

Nevertheless you might want to have a look at this.

Yes I did it several years ago. But you can't do it since

MSYS2 doesn't enable SSE2 for i686

-_-

This is only true if the number of instructions is bounded.

Citation needed

CarterLi commented 7 years ago

On modern Intel CPUs you can do memcpy with a brute rep movsb and gain maximum throughput that your hardware allows - this is not the case for movsw whatsoever. Is anyone willing to do that? I don't think so.

There is one: https://godbolt.org/g/lKXWoF

lhmouse commented 7 years ago

It seems that GCC tries to align pointer to QWORD for unknown mem addresses. is it required by stosq?

As required by x86 which handles unaligned memory access transparently, no. But unaligned access might be slow. It could be probably the case on some AMD and some old Intel processors, but apparently it isn't the case on mine (aligned and unaligned stosq are equally fast).

Which version of GCC are you using?

gcc-6-branch HEAD with -O3 -ffreestanding.

This is only true if the number of instructions is bounded. Citation needed

If there are an indeterminate number of instructions you would have to create a loop to do that, then movdqu would be a bit slower than movdqa (nevertheless it would not be any faster). If you are sure you only need two moves of DQWORDs then two movdqu would probably be faster than aligning the pointer + copying unsligned bytes + copying a DQWORD with movdqa.

CarterLi commented 7 years ago

gcc-6-branch HEAD with -O3 -ffreestanding.

Tested with -ffreestanding; GCC won't generate memset call. There must be something wrong with your compiler.

CarterLi commented 7 years ago

If there are an indeterminate number of instructions you would have to create a loop to do that, then movdqu would be a bit slower than movdqa

Again. It's Intel guys say that they have the same latency and throughput, not I.

2017-04-02 7 07 31 2017-04-02 7 07 53

For older CPUs, I don't know.

lhmouse commented 7 years ago

If these two instructions were equivalent fast, Intel people wouldn't bother themselves to create an additional one at the cost of more transistors and more power consumption. The movdqu instruction isn't necessarily slower, but it might be. There are a handful of articles about unaligned memory access with SIMD, for example, this on software.intel.com.

lhmouse commented 7 years ago

gcc-6-branch HEAD with -O3 -ffreestanding.

Tested with -ffreestanding; GCC won't generate memset call. There must be something wrong with your compiler.

GCC will not, if the number of bytes is known and is not very large during compile time. It will generate a function call if the number can't be predicted, for example, when it comes from a function parameter.

CarterLi commented 7 years ago

Yes, maybe core2 or P4, maybe AMDs. But the problem is: for buffers on stack pointers are usually aligned its ok; for buffers alloced in heap, they are usually unassigned (Because of new and few people are aware of aligned_malloc) but padded. In this case if you must align the pointer, extra ugly movs are needed which results in ugly code. You dont like them, do you?

For buffer size which are known on compile time, wo wont bother optimising them. Just call memset/memcpy.

lhmouse commented 7 years ago

for buffers alloced in heap, they are usually unassigned (Because of new and few people are aware of aligned_malloc) but padded.

The pointer returned by a successful malloc(), calloc() or realloc() function call must be aligned such that it may be assigned to a pointer to any type of object with a fundamental alignment requirement and then used to access such an object or an array of such objects in the space allocated (until the space is explicitly deallocated) (See ISO/IEC WG14 N1570, 7.22.3 Memory management functions).

CarterLi commented 7 years ago

It will generate a function call if the number can't be predicted, for example, when it comes from a function parameter.

If no more info at compile time, no more opimization can be done after inlining, except larger binary. So why do you insist inlining them?

CarterLi commented 7 years ago

The pointer returned by a successful malloc(), calloc() or realloc() function call must be aligned

Aligned to what? 8 for x86? Doesn't help

lhmouse commented 7 years ago

If no more info at compile time, no more opimization can be done after inlining, except larger binary. So why do you insist inlining them?

In the case of C11 thread and gthread there are only a few bytes to copy so a call to external function is overkilling.

Aligned to what? 8 for x86? Doesn't help

We just assume everything is properly aligned.

CarterLi commented 7 years ago

Let's stop this meaningless arguing.

Just see the rep* families: http://www.felixcloutier.com/x86/REP:REPE:REPZ:REPNE:REPNZ.html

It seems that they are born for mem* functions. Besides rep movs and rep stos

  1. repne scas: memchr, strlen
  2. repe cmps: memcmp

It doesn't seem that GCC can generate those instructions with __builtins, it will be much more meaningful to implement them ourselves.

And rep lods, don't know it's usecase...

lhmouse commented 7 years ago

And rep lods, don't know it's usecase...

It might be useful without rep.

CarterLi commented 7 years ago

On modern Intel CPUs you can do memcpy with a brute rep movsb and gain maximum throughput that your hardware allows - this is not the case for movsw whatsoever. Is anyone willing to do that? I don't think so.

image

Oh god! M$ do this in their memset implementation and call it Enhanced Fast Strings...

But the problem is: it's REALLY fast according to my quick test....

lhmouse commented 7 years ago

@CarterLi

See 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) in http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf.

CarterLi commented 7 years ago

Yes @lhmouse

Just googled this: http://stackoverflow.com/questions/33480999/how-can-the-rep-stosb-instruction-execute-faster-than-the-equivalent-loop#answer-33485055

It's CPU (hardware) level optimisation, OMG


Another quick test shows that: stosb has almost the same speed as stosq. In addition, it has no align problems.