Closed skeeto closed 5 months ago
I was initially looking at the diff without reading the description and wondered why use x86 asm as that limits the usefulness of this to a single arch. But after reading the motivation (faster debug builds) and considering the context (x86 is pretty much the only arch worth considering on windows) - the choice makes sense. Though, I can't comment on the actual implementation since I'm not familiar with those instructions.
As for the builtins, I don't see myself using them, at least not directly, to avoid unnecessary compiler requirements. But perhaps with some ifdefs it could work out. Though debug build performance hasn't been an issue for me thus far.
I don't see anything wrong with it unless there's some convention of which I'm unaware.
Me neither. But given how generic "memory" is, maybe a short suffix would be safer to avoid collision.
Other than that, nothing else particularly stands out.
I like this idea, and I think it would be great if mingw-w64 could provide the implementations instead, like visual studio seems to do, as you mentioned. I don't see myself using the w64devkit provided lib much, it looks like the sort of thing I'd prefer to copy into my own project, since I still value cross-compiling on my projects using "regular" mingw-w64 Linux cross toolchains.
Thanks for the responses!
@skeeto I was looking at the gcc 14 changes:
https://gcc.gnu.org/gcc-14/changes.html
And I noticed an intriguing option:
New option -finline-stringops, to force inline expansion of memcmp, memcpy, memmove and memset, even when that is not an optimization, to avoid relying on library implementations.
I did some testing in compiler explorer, and it may have made libmemory irrelevant: https://gcc.godbolt.org/z/o83do6W4r
Even the "invisble" generated calls to the string.h functions are replaced with inline implementations! At -Os it uses string instructions, and at -O2 it's using movaps to implement memset.
Definitely worth some investigation.
Doesn't seem very reliable. You still need -fno-builtin
/ -ffreestanding
to avoid it "optimizing" your string functions into library calls: https://godbolt.org/z/G8vToKEjv
Also, I'm assuming -finline-stringops
is basically a wrapper around -mstringop-strategy
which selects all the available algorithms except libcall
. Which is not bad, but I would've preferred if they just let the user choose multiple algos with -mstringop-strategy
instead.
Thanks for the heads up, @Peter0x44! On the surface that sounds like what I want, and I like how -Os uses string instructions, but I see @N-R-K has counter-examples. The most thorough option continues to be undocumented behavior from -fno-builtin, which Clang doesn't have at all. Even so, GCC seems to prefer mem{set,cpy} for large initializations/assignments despite -fno-builtin. Though -finline-stringops seems to deal with those cases, as your example shows.
https://gcc.godbolt.org/z/jhcsxbx4K
This is an old issue, and all three major compilers are on the same side of it, so I've accepted that external definitions are simply part of the territory. -fno-builtin smooths over some it, but it ultimately needs a fallback, which is what libmemory.a is about. I've used it with Clang and MSVC, and on Linux as well. If that's too inconvenient on Linux, another trick:
$ musl-gcc -static -nostartfiles ...
That plucks the necessary definitions out of musl without any baggage or interference. It's similar to what I've done on Windows before / without -lmemory. Without musl-clang, you can do the same with Clang by linking musl's libc.a explicitly.
Had another idea today: libchkstk.a
, providing ___chkstk_ms
(plus __chkstk
for MSVC as a bonus). This function is inserted in all functions with a stack frame larger than 4KiB in order to incrementally grow the stack. Mine is a lot better than the one in libgcc, and perhaps even something I could tolerate.
Also, it turns out the __chkstk
in x86-64 libgcc is totally busted. Clearly nobody has ever used it (i.e. linked libgcc with x64 MSVC objects that need it), because it would be immediately obvious that it didn't work. The i686 definition is fine, though, if still unoptimal.
I also learned that the x86 __chkstk
is bonkers: Unlike ___chkstk_ms
, the actual frame allocation occurs inside __chkstk
, making it the most difficult to write. For x64, Microsoft chose GCC's semantics instead. (Deliberate copying, or just coincidence?)
(Also, that's three underscores in ___chkstk_ms
, which I never noticed before!)
I'm satisfied with the name, but I'm going to dogfood it awhile before making a decision about inclusion.
This static library provides memset, memcpy, memmove, memcmp, and strlen. The implementations are minimal wrappers around x86 string instructions, making them both tiny (~25 bytes apiece) and reasonably performant, though not nearly as fast as CRT implementations. I selected these particular functions because GCC fabricates calls to each out of thin air at some optimization levels. The first four are also genuinely useful as built-ins (__builtin_memset, etc.), keeping in mind arbitrary limitations with null.
Mingw-w64 does not implement them but instead offers imports from system DLLs: -lmsvcrt, -lmsvcr120, -lucrt, -lntdllcrt. It creates a dependency on a system DLL: practical, though unoptimal, and no licensing woes. With -nostartfiles instead of -nostdlib, this happens implicitly.
MSVC provides static definitions in libvcuntime.lib, useful when cl or clang-cl fabricates calls. They weigh several KB each, and require the application or installer to present a EULA ("external end users to agree to terms"). Not a great trade-off just for some basic memory functions.
In w64devkit, libmemory fills this role. It's a public domain library so there are no license terms. Just add -lmemory to the build command. It also allows liberal use of the associated built-ins, especially in debug builds, which can benefit from fast memory operations despite -O0. MSVC toolchains can also freely use libmemory.a, though it won't know how to find it without help of course.
I do not plan to add more functions except for cases of GCC fabricating calls (e.g. strlen). Certainly no null-terminated string functions.
I'm interested in feedback before settling the details (@N-R-K, @Peter0x44). (Responses in my public inbox are welcome, too.) I'm not particularly attached to the name (
-lmemory
), but I don't see anything wrong with it unless there's some convention of which I'm unaware. I've been dogfooding this branch, and this little library has already proven useful.First, as noted, this basically eliminates problems with GCC inserting calls to these functions, at least within w64dekvit. That includes software built for w64devkit (u-config, etc.), as it's built during bootstrap, though nothing uses it yet.
Second, as I later discovered, the matching GCC built-ins are now as reliable as intrinsics. They Just Work, even without a CRT, exactly the way I want, and I don't have to worry about supplying implementations later. GCC will do smart things with the built-ins even at low optimization levels. For example,
-O1
isn't enough for GCC to unroll/vectorize copy/zero loops, but with__builtin_*
it's as smart as-O2
. At-O0
it's a call intolibmemory
, which is fast without hurting debugging. With hand-written loops, memory copying/zeroing/etc. are bottlenecks at-O0
(lots of iterations with lots of instructions per byte), but with built-ins my debug builds are practically as fast as my "release" builds.Here are some samples. A quick allocator, fast even in release builds:
String comparison (deals with the stupid null thing, too):
String cloning:
Putting it all together, this string table is practically as fast at
-O0
as-O3
!So I'm quite happy with it. Just want to make sure I'm not missing anything.