skeeto / w64devkit

Portable C and C++ Development Kit for x64 (and x86) Windows
The Unlicense
2.66k stars 185 forks source link

New runtime library: libmemory #102

Closed skeeto closed 5 months ago

skeeto commented 6 months ago

This static library provides memset, memcpy, memmove, memcmp, and strlen. The implementations are minimal wrappers around x86 string instructions, making them both tiny (~25 bytes apiece) and reasonably performant, though not nearly as fast as CRT implementations. I selected these particular functions because GCC fabricates calls to each out of thin air at some optimization levels. The first four are also genuinely useful as built-ins (__builtin_memset, etc.), keeping in mind arbitrary limitations with null.

Mingw-w64 does not implement them but instead offers imports from system DLLs: -lmsvcrt, -lmsvcr120, -lucrt, -lntdllcrt. It creates a dependency on a system DLL: practical, though unoptimal, and no licensing woes. With -nostartfiles instead of -nostdlib, this happens implicitly.

MSVC provides static definitions in libvcuntime.lib, useful when cl or clang-cl fabricates calls. They weigh several KB each, and require the application or installer to present a EULA ("external end users to agree to terms"). Not a great trade-off just for some basic memory functions.

In w64devkit, libmemory fills this role. It's a public domain library so there are no license terms. Just add -lmemory to the build command. It also allows liberal use of the associated built-ins, especially in debug builds, which can benefit from fast memory operations despite -O0. MSVC toolchains can also freely use libmemory.a, though it won't know how to find it without help of course.

I do not plan to add more functions except for cases of GCC fabricating calls (e.g. strlen). Certainly no null-terminated string functions.


I'm interested in feedback before settling the details (@N-R-K, @Peter0x44). (Responses in my public inbox are welcome, too.) I'm not particularly attached to the name (-lmemory), but I don't see anything wrong with it unless there's some convention of which I'm unaware. I've been dogfooding this branch, and this little library has already proven useful.

First, as noted, this basically eliminates problems with GCC inserting calls to these functions, at least within w64dekvit. That includes software built for w64devkit (u-config, etc.), as it's built during bootstrap, though nothing uses it yet.

Second, as I later discovered, the matching GCC built-ins are now as reliable as intrinsics. They Just Work, even without a CRT, exactly the way I want, and I don't have to worry about supplying implementations later. GCC will do smart things with the built-ins even at low optimization levels. For example, -O1 isn't enough for GCC to unroll/vectorize copy/zero loops, but with __builtin_* it's as smart as -O2. At -O0 it's a call into libmemory, which is fast without hurting debugging. With hand-written loops, memory copying/zeroing/etc. are bottlenecks at -O0 (lots of iterations with lots of instructions per byte), but with built-ins my debug builds are practically as fast as my "release" builds.

Here are some samples. A quick allocator, fast even in release builds:

static char *alloc(arena *a, int size, int count)
{
    int align = -((unsigned)size*count) & (sizeof(void *) - 1);
    if (count > (a->end - a->beg - align)/size) {
        __builtin_trap();
    }
    a->end -= align + size*count;
    return __builtin_memset(a->end, 0, size*count);
}

String comparison (deals with the stupid null thing, too):

static int strequals(str a, str b)
{
    return a.len==b.len && !(a.len && __builtin_memcmp(a.data, b.data, a.len));
}

String cloning:

static str strclone(str s, arena *perm)
{
    str r = {0};
    r.data = new(perm, char, s.len);
    r.len = s.len;
    if (s.len) __builtin_memcpy(r.data, s.data, s.len);
    return r;
}

Putting it all together, this string table is practically as fast at -O0 as -O3!

typedef struct tab tab;
struct tab {
    tab *child[2];
    str  key;
};

static str intern(tab **t, str key, arena *perm)
{
    for (unsigned h = strhash(key); *t; h <<= 1) {
        if (strequals((*t)->key, key)) {
            return (*t)->key;
        }
        t = &(*t)->child[h>>31];
    }
    return (*t = new(perm, tab, 1))->key = strclone(key, perm);
}

So I'm quite happy with it. Just want to make sure I'm not missing anything.

N-R-K commented 6 months ago

I was initially looking at the diff without reading the description and wondered why use x86 asm as that limits the usefulness of this to a single arch. But after reading the motivation (faster debug builds) and considering the context (x86 is pretty much the only arch worth considering on windows) - the choice makes sense. Though, I can't comment on the actual implementation since I'm not familiar with those instructions.

As for the builtins, I don't see myself using them, at least not directly, to avoid unnecessary compiler requirements. But perhaps with some ifdefs it could work out. Though debug build performance hasn't been an issue for me thus far.

I don't see anything wrong with it unless there's some convention of which I'm unaware.

Me neither. But given how generic "memory" is, maybe a short suffix would be safer to avoid collision.

Other than that, nothing else particularly stands out.

Peter0x44 commented 6 months ago

I like this idea, and I think it would be great if mingw-w64 could provide the implementations instead, like visual studio seems to do, as you mentioned. I don't see myself using the w64devkit provided lib much, it looks like the sort of thing I'd prefer to copy into my own project, since I still value cross-compiling on my projects using "regular" mingw-w64 Linux cross toolchains.

skeeto commented 6 months ago

Thanks for the responses!

Peter0x44 commented 5 months ago

@skeeto I was looking at the gcc 14 changes: https://gcc.gnu.org/gcc-14/changes.html And I noticed an intriguing option: New option -finline-stringops, to force inline expansion of memcmp, memcpy, memmove and memset, even when that is not an optimization, to avoid relying on library implementations.

I did some testing in compiler explorer, and it may have made libmemory irrelevant: https://gcc.godbolt.org/z/o83do6W4r

Even the "invisble" generated calls to the string.h functions are replaced with inline implementations! At -Os it uses string instructions, and at -O2 it's using movaps to implement memset.

Definitely worth some investigation.

N-R-K commented 5 months ago

Doesn't seem very reliable. You still need -fno-builtin / -ffreestanding to avoid it "optimizing" your string functions into library calls: https://godbolt.org/z/G8vToKEjv

Also, I'm assuming -finline-stringops is basically a wrapper around -mstringop-strategy which selects all the available algorithms except libcall. Which is not bad, but I would've preferred if they just let the user choose multiple algos with -mstringop-strategy instead.

skeeto commented 5 months ago

Thanks for the heads up, @Peter0x44! On the surface that sounds like what I want, and I like how -Os uses string instructions, but I see @N-R-K has counter-examples. The most thorough option continues to be undocumented behavior from -fno-builtin, which Clang doesn't have at all. Even so, GCC seems to prefer mem{set,cpy} for large initializations/assignments despite -fno-builtin. Though -finline-stringops seems to deal with those cases, as your example shows.

https://gcc.godbolt.org/z/jhcsxbx4K

This is an old issue, and all three major compilers are on the same side of it, so I've accepted that external definitions are simply part of the territory. -fno-builtin smooths over some it, but it ultimately needs a fallback, which is what libmemory.a is about. I've used it with Clang and MSVC, and on Linux as well. If that's too inconvenient on Linux, another trick:

$ musl-gcc -static -nostartfiles ...

That plucks the necessary definitions out of musl without any baggage or interference. It's similar to what I've done on Windows before / without -lmemory. Without musl-clang, you can do the same with Clang by linking musl's libc.a explicitly.

skeeto commented 5 months ago

Had another idea today: libchkstk.a, providing ___chkstk_ms (plus __chkstk for MSVC as a bonus). This function is inserted in all functions with a stack frame larger than 4KiB in order to incrementally grow the stack. Mine is a lot better than the one in libgcc, and perhaps even something I could tolerate.

Also, it turns out the __chkstk in x86-64 libgcc is totally busted. Clearly nobody has ever used it (i.e. linked libgcc with x64 MSVC objects that need it), because it would be immediately obvious that it didn't work. The i686 definition is fine, though, if still unoptimal.

I also learned that the x86 __chkstk is bonkers: Unlike ___chkstk_ms, the actual frame allocation occurs inside __chkstk, making it the most difficult to write. For x64, Microsoft chose GCC's semantics instead. (Deliberate copying, or just coincidence?)

(Also, that's three underscores in ___chkstk_ms, which I never noticed before!)

I'm satisfied with the name, but I'm going to dogfood it awhile before making a decision about inclusion.