Direct Driver HAL - Githubissues

makslevental commented 1 month ago

quote-you-know-crazy-straws-they-go-all-over-the-place-these-straws-are-sane-they-never-lost-mitch-hedberg-129-5-0527

Let's try this again: this is a HAL that skips/avoids/elides XRT completely. It does so by (essentially) going straight to the XDNA driver's (kmq) shim which talks directly to the kernel module. Note that XDNA's shim actually couples tightly to XRT itself so the bulk of the actual "work" here is/was decoupling. Thus, we vendor the decoupled shim. Just to be explicit/clear: after this PR, on Linux, the only thing needed in the environment is the kernel module itself - not XRT, nor XDNA's shim.

Call outs:

Only Linux for now (Windows soon);
On Linux, we no longer even need to clone XRT (completely really does mean completely);
- We no longer use .xclbin as the artifact and now just load the .pdi directly;
This uses the so-called kernel-managed queue (kmq) path; user-managed queues (umq) aren't supported until aie4;
The current implementation here is still sync and still uses direct (rather than indirect) command buffers; this PR is already large/complex enough without changing the paradigms. The next PR will add/implement those;
Nothing about how drivers themselves are bumped needs to change - you/anyone can continue to install the XDNA drivers using their full instructions there and everything will work the same, i.e., the XRT pieces will still be skipped even if you have them installed. But I will probably put together a script that builds/installs only their kernel module for those that want to opt-out of XRT in their environment;
For now, the XRT HAL will still work on Linux but it should be considered deprecated (hence no longer tested in CI).

TODOs before landing:

[x] remove instructional comments leftover from the null HAL;
[x] undo "OO"-ness of the HAL
[x] add missing trace zones
[x] remove smart pointers where possible
[ ] ~~move hwq into kernel~~
[x] remove exceptions from the shim.
[x] parameterize number of cols/rows (currently hardcoded 4x4 for Phoenix);

shout-out to @benvanik for putting together the null HAL lickety-split for me to study while working on this PR.

makslevental commented 1 month ago

Nice! I am wondering though, one of the issues with XRT HAL was the heavy dependence on C++ which was leaking into the HAL C structs that cant have C++ objects so we had pointers to them and ran into ownership issues of those pointers.

Interesting - do you remember which objects specifically? I have to admit that one thing that isn't clear to me (in HAL land) is what is allowed-to-be/should-be an "inlined struct" and what needs to be a pointer. Currently I believe I'm managing lifetimes correctly but I haven't actually rigorously checked (I keep failing to successfully get an ASAN build working for us).

nirvedhmeshram commented 1 month ago

Nice! I am wondering though, one of the issues with XRT HAL was the heavy dependence on C++ which was leaking into the HAL C structs that cant have C++ objects so we had pointers to them and ran into ownership issues of those pointers.

Interesting - do you remember which objects specifically? I have to admit that one thing that isn't clear to me (in HAL land) is what is allowed-to-be/should-be an "inlined struct" and what needs to be a pointer. Currently I believe I'm managing lifetimes correctly but I haven't actually rigorously checked (I keep failing to successfully get an ASAN build working for us).

I think all the smart pointer patterns you see in the XRT HAL are a symptom of that issue. We essentially used the .release() on smart pointers to take ownership of the objects as you couldnt just new/delete them in the XRT API, and an example of the ownership issues I am referring to are shown by the need of PRs like this https://github.com/nod-ai/iree-amd-aie/pull/641 although looks like that was later fixed (as I don't see the global device in the xrt_device.h in the code anymore) We also had trouble with the compute kernel being deleted because we were deleting the xclbin, but we technically had ownership of the kernel.

nod-ai / iree-amd-aie

Direct Driver HAL #816