CPUID feature probing - Githubissues

moon-chilled commented 2 years ago

Hi,

Currently, it seems sb-simd will do featureset probing with cpuid and disable intrinsics which are not available on the platform being compiled on. I think that it shouldn't do this, because it does not allow for certain valid use cases. Somebody might want to make a binary artifact tuned for a specific platform which is not the same as the one being compiled on. Or somebody might want to do runtime feature detection, in the case when the operation being performed is expensive enough for that to be worthwhile. I think that, in order to support these use cases, sb-simd should act primarily as an encoder of instructions, leaving it to the application to decide when those instructions should run.

(The issue of CPU targeting is still one that needs to be solved, and I am planning to write a library to deal with it. Library will permit either run time or compile time feature detection on the part of libraries, at their discretion; will default to probing but will support customisation on the part of applications; will include more than just featuresets: eg uarch name, cache sizes, maybe even cycle timings.)

marcoheisig commented 2 years ago

Hi moon-chilled, thanks for your feedback!

Assuming a host system where you use sb-simd to produce some code, fasl, or binary, and a target system where you run it, there are basically two interesting cases:

Instruction sets that are available on the host, but not on the target.
Instruction sets that are available on the target, but not on the host.

I decided that 2. is not worth going after. With that decision in place, I know that a function that is not available on the host is statically known to be not available on the target. The question is how to deal with case 1. And I hope you'll be delighted to hear that I already have a solution for that. The instruction-set-case macro allows dispatching on the available instruction sets. I use some load-time-value magic to ensure that each use of that macro translates into a single unconditional jump with an offset that is recomputed whenever the SBCL image is (re-)launched.

I hope that addresses your concerns.

About that feature detection and timing library of yours: I already have some amount of /proc/... parsing in Petalisp (https://github.com/marcoheisig/Petalisp/blob/master/code/ir/device.lisp) but I don't use it, yet. For the CPU performance counters and benchmarking, I would love to simply have a Lisp interface to likwid (https://github.com/RRZE-HPC/likwid). Disclaimer: I work at the same place as the likwid developers.

moon-chilled commented 2 years ago

Hi, sorry about the delayed response.

I decided that 2. is not worth going after

I think that's reasonable.

load-time-value magic to ensure that each use of that macro translates into a single unconditional jump with an offset that is recomputed whenever the SBCL image is (re-)launched

Ah, I saw the macro, but I missed that it it did that. That is clever! But it doesn't really address my concerns, no.

The issues are manifold, but the biggest one is that feature levels aren't everything. It's common to also specialise for a particular uarch, for instance; see the various blas implementations. I can also imagine statically specialising for things like cache sizes: eg new intel cpus have 48kb I$, and maybe that means I can unroll my loops while still keeping my working set hot.

This issue of static vs dynamic specialisation is a related one. Or, rather—the issue of exactly how highly hoisted the specialisation is. Suppose I implement FMA as a multiply followed by an add on platforms without hardware support, I'm not going to want even a direct jump in the middle of my tight loop; I'm going to want to hoist the jump at the very least outside of the loop, and maybe even further for locality reasons. phoe's with-branching provides one template for specifying this information exactly, but it might be desirable to go even further, eg applying interprocedurally.

All of these are going to be somewhat application-specific and require some application-specific machinery, but there are some facets that are common and that a library would help with. My concern is that instruction-set-case, while useful, addresses only one facet of these highly interconnected specialisation-related concerns, and I think a unified approach would be superior. If sb-simd wants to provide such a unified suite of functionality, great, but if it does not, I think it should not take half measures.

For the CPU performance counters and benchmarking, I would love to simply have a Lisp interface to likwid

Likwid is cool, I didn't know about it. I definitely think benchmarking and performance counters are important, and would like to see a solution for them. But that's not quite what I meant by 'cycle timings'; I meant a static database, like you might see from agner or uops.info, which can be used for finer-grained specialisation.

marcoheisig commented 2 years ago

sb-simd has now been merged into sbcl, so I'm archiving this repository.

Note that I haven't forgotten your remarks about the specialization macros. But developing such functionality is outside of the scope of sb-simd. Maybe this is something for our Loopus project.

marcoheisig / sb-simd

CPUID feature probing #18