rui314 / mold

Mold: A Modern Linker 🦠
MIT License
14.07k stars 461 forks source link

SIMD #92

Closed kirawi closed 3 years ago

kirawi commented 3 years ago

Full disclosure, I'm completely unfamiliar with the kind of work you're doing and especially the problem you're trying to tackle. I've made the assumption that there is code where SIMD is applicable, but I have not assumed that it will improve performance. That out of the way, I hope my ignorance isn't more than mildly offensive 😅

Would it be possible to utilize SIMD to improve upon performance further? I notice that aside from everything that's being done on the concurrency side of things, there's a lot of IO work being done. I'm not sure if SIMD would even help there or if it would be bottle-necked by the drive speed. Or even if what you're doing is worth the overhead of SIMD, which would instead mean that SIMD may actually hurt performance.

Alcaro commented 3 years ago

SIMD can speed up large loops where each iteration does the same thing (no input-dependent branches), and each iteration does not depend on the previous one. For example, SIMD can speed up most kinds of non-cryptographic checksum calculation.

SIMD can not speed up anything else, including branchy code, recursion, too short loops, non-loops, and most data structure operations (arrays are fine, few other things are). Sometimes it's possible to rewrite a function into a SIMD-friendlier form, but it's rare.

I think mold spends most of its time in TBB hashmaps. The hash calculation may be SIMDable, unless TBB already does that; the rest of the hashmap is, as far as I know, not SIMDable.

SIMD can also only speed up your own code. Disk access belongs to the kernel, not to mold.

kirawi commented 3 years ago

I agree with you, I was more so referring to the resolving symbols step. simdjson uses some clever SIMD tricks discussed in their paper to avoid branching while still being able to resolve symbols, but I'm not certain how applicable it would be here.

As for hashing, https://github.com/BLAKE3-team/BLAKE3 (6.8 GiB/s versus SHA1's 1 GiB/s in their benchmark), would be a good candidate, but I imagine @rui314 isn't keen on adding more dependencies unless they're absolutely necessary.

rui314 commented 3 years ago

mold does not use SIMD instructions explicitly, but SIMD is used at a lot of places in mold, because many library functions are implemented using SIMD. For example, glibc's strlen is implemented using SSE 4.2's instructions, I believe. Other example is xxhash3. There might be other places that I can use SIMD to improve mold's performance, but I can't come up with anything right now.

As to cryptographic hashing, I actually tried BLAKE3. We are currently using SHA-256 for Identical Comdat Folding and Build-ID computation. For the former use case, we need to compute a cryptographic hash for small data (typically less than 100 bytes). For the latter, we compute a SHA-256 for the entire output file, which can be as large as multi-gigabyte.

It looks like BLAKE3 is slower than SHA-256 at least on my machine for small data. This is perhaps due to high initialization and finalization cost. For large data, BLKAE3 is indeed faster than SHA-256 by a factor of two. If we have enough number of cores, build-id computation is bounded by memory bandwidth even with SHA-256, so I don't see an immediate need to switch to BLAKE3, though.