Closed CarterLi closed 7 years ago
-ffree-standing
enabled so a function call is a function call. If we have a call to memcpy()
or __builtin_memcpy()
it results in an external call to that function. Hardcoding inline versions of these functions might sound a bad idea, but it is an effective way to prevent GCC from generating such external function calls completely.
-ffree-standing
in such a lib for multi-threading?I have been carefully checking use of these functions. We do have better solutions (indeed SSE2 is a lot better) but it can't be enabled on x86 for historical reasons.
We already drop XP/Vista. I see no reason to support CPUs that don't support SSE2 ( which means Pentium 3 and older).
For personal use, I'd rather recompile the whole library with -march=native
, which means AVX2
And don't align memory manually when given pointer is not aligned. Just use *loadu*
functions directly. They do have the same performance in modern CPUs.
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#cats=Load,Store&techs=SSE2&expand=3144,3083
Why do we use -ffree-standing in such a lib for multi-threading? Why we can call external Win32 APIs but an standard & general used C API?
That is because MSVCRT is unavailable. NTDLL does have many C library functions exported however I just want to get them out of my sight. This is true even for MCRCRT which has basically three versions of memcpy()
(an inline version using rep movs
and inline and external versions using SSE2).
XP support was dropped, but Vista support was not despite being completely untested.
MSYS2 doesn't enable SSE2 for i686 and I have to follow that. Note that Intel is still selling super tiny x86 CPUs that even don't have SSE.
And don't align memory manually when given pointer is not aligned. Just use loadu functions directly. They do have the same performance in modern CPUs.
This is only true if the number of instructions is bounded.
That is because MSVCRT is unavailable.
That is not the case. 99% of people that use pthread ( mcfgthread in this case ) will use standard C apis. For MinGW MSVCRT is always linked.
MSYS2 doesn't enable SSE2 for i686
OMG... So let's #ifndef
That it not the case. 99% of people that use pthread ( mcfgthread in this case ) will use standard C apis. For MinGW MSVCRT is always linked.
OMG... So let's #ifndef
No just keep the rep movs
version. It is generic and efficient enough for small chunks of bytes.
Summary:
I see no reason for such a performance-oriented project to do such a bad thing.
You add another two functions, which means larger code base, larger binary, and maybe more bugs.
There may be. There ain't going to be. There is a maintainer.
You don't use builtins instead write it yourself, which means you prevent the chance for compiler from optimizing the code.
I don't want it to.
In addition, compiler can't assume that buffers are zero-cleaned or copied, which may prevent other optimizations.
At the moment, GCC doesn't produce better code given the assumption. I don't think it is going to, either.
You use inline asm, which means your code is harder to migrate.
Basically: I don't care. The code is designed for x64 and x86, and will work on x64 and x86, period. If someone wants to port code to other platforms, just do it. I appreciate it.
I see no reason for such a performance-oriented project to do such a bad thing.
Is it? We are being semantically correct by adding -ffree-standing
just because MSVCRT isn't there despite the fact that NTDLL exports memcpy()
(which is a synonym of RtlCopyMemory()
) and memset()
, aren't we? The result is that __builtin_memcpy()
and __builtin_memset()
always create function calls that can't be inlined at all, so we have to implement inline ones ourselves. In reality, the inline versions in mcfgthread take up less room and don't contain any branches at all, surpassing the inline ones generated by GCC.
Well... Are those functions provided by NTDLL that awful? So why do they provide such a function instead of using the highly optimised one provided by MSVCRT? For stability or something? Instead of providing a new function, can we NTR NTDLL by using the same function name?
Such functions provided by NTDLL
are usually stable and not so optimized as those by provided MSVCR???
.
I pushed a more optimized version of inline_mem.h. See https://godbolt.org/g/4mJ2P0.
Well... Are those functions provided by NTDLL that awful? So why do they provide such a function instead of using the highly optimised one provided by MSVCRT? For stability or something?
Microsoft people do have their own consideration.
On modern Intel CPUs you can do memcpy
with a brute rep movsb
and gain maximum throughput that your hardware allows - this is not the case for movsw
whatsoever. Is anyone willing to do that? I don't think so.
Instead of providing a new function, can we NTR NTDLL by using the same function name?
No comments.
Microsoft people do have their own consideration. On modern Intel CPUs you can do memcpy with a brute rep movsb and gain maximum throughput that your hardware allows - this is not the case for movsw whatsoever. Is anyone willing to do that? I don't think so.
I have never watched the disasm of memcpy provided by NTDLL because I dont know how to call it directly in my code. Im just guessing. But i have watched the disasm of lstrlen(A/W), plain loop is used even without basic unlooping optimizion.
The question is: If the functions provided by NTDLL are fullly optimized, why do you refuse to use them.
Ps: i dont think you have to rep movsb if you need only one mov or two
I have never watched the disasm of memcpy provided by NTDLL because I dont know how to call it directly in my code.
Add -lntdll
into the command line before other libraries, or use LoadLibrary()
+ GetProcAddress()
.
Im just guessing. But i have watched the disasm of lstrlen(A/W), plain loop is used even without basic unlooping optimizion.
On modern processors manual loop unrolling helps little.
Nevertheless you might want to have a look at this. strlen(s)
is effectively (size_t)(rawmemchr(s, 0) - s)
. There is a variant for wchar_t
too.
The question is: If the functions provided by NTDLL are fullly optimized, why do you refuse to use them.
They aren't. At least they can't be inlined.
Ps: i dont think you have to rep movsb if you need only one mov or two
As a compiled function it just can't know how many bytes to copy/fill. There is no silver bullet.
The result is that __builtin_memcpy() and __builtin_memset() always create function calls that can't be inlined at all, so we have to implement inline ones ourselves. In reality, the inline versions in mcfgthread take up less room and don't contain any branches at all, surpassing the inline ones generated by GCC.
stosq
and friends:It seems that GCC tries to align pointer to QWORD for unknown mem addresses. is it required by stosq
?
EDIT: GCC won't align pointers when compiling with -Os
GCC suggested a clever sollution. If an unpadded buffer size is given, there will be less than 1 DWORD/QWORD remains after stosq/stosd.
eg.
| 0 0 0 0, 0 0 0 0, 0 0 0 0, x x x |
Instead of stosb
the last 3 bytes, we can set the last 1 DWORD once.
| 0 0 0 0, 0 0 0 0, 0 0 0 _, _ _ _ |
mov
unlike what you did.The result is that __builtin_memcpy() and __builtin_memset() always create function calls that can't be inlined at all, so we have to implement inline ones ourselves.
Which version of GCC are you using?
Nevertheless you might want to have a look at this.
Yes I did it several years ago. But you can't do it since
MSYS2 doesn't enable SSE2 for i686
-_-
This is only true if the number of instructions is bounded.
Citation needed
On modern Intel CPUs you can do memcpy with a brute rep movsb and gain maximum throughput that your hardware allows - this is not the case for movsw whatsoever. Is anyone willing to do that? I don't think so.
There is one: https://godbolt.org/g/lKXWoF
It seems that GCC tries to align pointer to QWORD for unknown mem addresses. is it required by stosq?
As required by x86 which handles unaligned memory access transparently, no. But unaligned access might be slow. It could be probably the case on some AMD and some old Intel processors, but apparently it isn't the case on mine (aligned and unaligned stosq
are equally fast).
Which version of GCC are you using?
gcc-6-branch HEAD with -O3 -ffreestanding
.
This is only true if the number of instructions is bounded. Citation needed
If there are an indeterminate number of instructions you would have to create a loop to do that, then movdqu
would be a bit slower than movdqa
(nevertheless it would not be any faster). If you are sure you only need two moves of DQWORDs then two movdqu
would probably be faster than aligning the pointer + copying unsligned bytes + copying a DQWORD with movdqa
.
gcc-6-branch HEAD with -O3 -ffreestanding.
Tested with -ffreestanding
; GCC won't generate memset
call. There must be something wrong with your compiler.
If there are an indeterminate number of instructions you would have to create a loop to do that, then movdqu would be a bit slower than movdqa
Again. It's Intel guys say that they have the same latency and throughput, not I.
For older CPUs, I don't know.
If these two instructions were equivalent fast, Intel people wouldn't bother themselves to create an additional one at the cost of more transistors and more power consumption. The movdqu
instruction isn't necessarily slower, but it might be. There are a handful of articles about unaligned memory access with SIMD, for example, this on software.intel.com.
gcc-6-branch HEAD with -O3 -ffreestanding.
Tested with -ffreestanding; GCC won't generate memset call. There must be something wrong with your compiler.
GCC will not, if the number of bytes is known and is not very large during compile time. It will generate a function call if the number can't be predicted, for example, when it comes from a function parameter.
Yes, maybe core2 or P4, maybe AMDs. But the problem is: for buffers on stack pointers are usually aligned its ok; for buffers alloced in heap, they are usually unassigned (Because of new and few people are aware of aligned_malloc) but padded. In this case if you must align the pointer, extra ugly movs are needed which results in ugly code. You dont like them, do you?
For buffer size which are known on compile time, wo wont bother optimising them. Just call memset/memcpy.
for buffers alloced in heap, they are usually unassigned (Because of new and few people are aware of aligned_malloc) but padded.
The pointer returned by a successful malloc()
, calloc()
or realloc()
function call must be aligned such that it may be assigned to a pointer to any type of object with a fundamental alignment requirement and then used to access such an object or an array of such objects in the space allocated (until the space is explicitly deallocated) (See ISO/IEC WG14 N1570, 7.22.3 Memory management functions).
It will generate a function call if the number can't be predicted, for example, when it comes from a function parameter.
If no more info at compile time, no more opimization can be done after inlining, except larger binary. So why do you insist inlining them?
The pointer returned by a successful malloc(), calloc() or realloc() function call must be aligned
Aligned to what? 8 for x86? Doesn't help
If no more info at compile time, no more opimization can be done after inlining, except larger binary. So why do you insist inlining them?
In the case of C11 thread and gthread there are only a few bytes to copy so a call to external function is overkilling.
Aligned to what? 8 for x86? Doesn't help
We just assume everything is properly aligned.
Let's stop this meaningless arguing.
Just see the rep* families: http://www.felixcloutier.com/x86/REP:REPE:REPZ:REPNE:REPNZ.html
It seems that they are born for mem* functions. Besides rep movs
and rep stos
repne scas
: memchr, strlenrepe cmps
: memcmpIt doesn't seem that GCC can generate those instructions with __builtin
s, it will be much more meaningful to implement them ourselves.
And rep lods
, don't know it's usecase...
And rep lods, don't know it's usecase...
It might be useful without rep
.
On modern Intel CPUs you can do memcpy with a brute rep movsb and gain maximum throughput that your hardware allows - this is not the case for movsw whatsoever. Is anyone willing to do that? I don't think so.
Oh god! M$ do this in their memset implementation and call it Enhanced Fast Strings...
But the problem is: it's REALLY fast according to my quick test....
@CarterLi
See 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) in http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf.
Yes @lhmouse
Just googled this: http://stackoverflow.com/questions/33480999/how-can-the-rep-stosb-instruction-execute-faster-than-the-equivalent-loop#answer-33485055
It's CPU (hardware) level optimisation, OMG
Another quick test shows that: stosb has almost the same speed as stosq. In addition, it has no align problems.
For example, memcpy for small buffer: https://godbolt.org/g/AkekLW And buffers for memcpy with constant size are usually small and aligned
So don't do that.