[Mini-RFC] Tracking issue for single source cross-compilation

As suggested, this issue is about the Rust single source cross compilation story. I've had some success with this while I was working on mir-hsa and the AMDGPU Stuff. In my case, the HSA API is used, which accepts GPU native ELF files directly, and therefore enables me to not have to write a new codegen backend. I have done this with the following changes to Rust:

[ ] Plugin provided rustc-intrinsics, which run post monomorphization to have access to only concrete types and usage. They are limited to inserting MIR statements just before the original TerminatorKind::Call. This is useful because all Rust functions have a unique type. Thus a call the intrinsic like this one (note the upvar stuff is nonfunctional, however. I need to put more thought into that part) will “return” the DefId (rather, equivalent info) of the function passed in. After expansion, trans rewrites the terminator to be a direct goto. This is the plugin that expands the kernel_info_for intrinsic mentioned/linked earlier.

This expansion originally occurred during trans, like when traditional LLVM intrinsics are handled. However, I think it could be made to happen before collection and partitioning. Either way, this implementation as is allows crate authors to use plugin intrinsics and not force downstream crates to also load that plugin (downstream crates can't be able to call the intrinsics generically).

Here is the trait for the plugin registry:

pub trait MirPluginIntrinsicTrans {
    fn trans_simple_intrinsic<'a, 'tcx>(&self, tcx: TyCtxt<'a, 'tcx, 'tcx>,
                                        name: &str,
                                        source_info: SourceInfo,
                                        sig: &FnSig<'tcx>,
                                        parent_mir: &mir::Mir<'tcx>,
                                        parent_param_substs: &'tcx Substs<'tcx>,
                                        args: &Vec<Operand<'tcx>>,
                                        return_place: Place<'tcx>,
                                        extra_stmts: &mut Vec<StatementKind<'tcx>>)
        where 'tcx: 'a;
}

The extra statements needed to store the return values are put into extra_stmts, which are translated just after the function returns. The other parameters are so one has the needed things to monomorphize types, and for debugging info.

My impl had Rust just trusting that the "return" value provided by the plugin matched the type specified by the intrinsic declaration. This is probably not what we want long term.

[ ] A codegen/”debugging” flag to always save/keep Rust’s cstore metadata in outputs. Combined with -Z always-encode-mir, we now have every function’s MIR available at runtime. My original impl made this hard coded ‘cause I’m lazy, so I’ll need to fix this before issuing a PR.
[ ] An extra codegen/”debugging” flag to force librustc_mir::monomorphize to translate everything, without relying on upstream definitions and a linker. I made this inaccessible to everyone except specialized rustc drivers, ie can’t use rustc -Z retrans-all-deps .., to prevent misuse.
[ ] Make librustc_metadata::schema public, so one can use CrateRoot and friends. Here is how the metadata was loaded (finding every dylib, including dylibs not actually mapped into the current process, is done elsewhere).
[ ] Make Rust accept a mono item collector root override. This is used by a special rustc driver at compiled program runtime to rerun trans for specific functions.

Issues (as implemented, so mostly issues related to my runtime impl):

[ ] This method is limited to Rust code only.
[ ] It also doesn’t allow performing codegen at compile-time (ie generating for the cross when you’re compiling for the host; the cross must be “compiled” at runtime).
[ ] Globals referenced will not be shared. Or, globals are defined in the output. Host/GPU ABI differences.
[ ] Closure upvars are ignored (IIRC, they didn’t work at all when I tried with my prototype, but I decided to ignore this while I got other things working so..). The runtime (eg my mir-hsa::runtime crate) will need to know about the upvars of a closure so that they can get mapped on to the GPU.
[ ] New Send like trait: NumaSend. This is needed so that types can do appropriate serialization of inner memory regions when they are sent to other memories. Doesn’t seem like there is a whole lotta thought into this area, yet.

Has anyone else worked on any single source prototype? I see this topic on discuss.rust-lang.org, but nothing recent.

RFC of sorts, so please criticize. I'm going to submit patches for the proposed changes above, and I would like to at least get something functionally equivalent accepted. @eddyb

Edit log:

Fix missing link

(Nitpick: please note that we've moved from trans to codegen for referring to MIR -> LLVM IR/etc.)

It's not super clear what the motivations and goals here are. I see a lot of interesting technical details wrt prerequisites, but not much about the actual MIR-based cross-compilation backend. I'm guessing mir-hsa should link to https://github.com/DiamondLovesYou/rust-mir-hsa? (Btw, are you aware of https://github.com/MaikKlein/rlsl? Seems like there's a potential overlap)

Today, a Rust crate, once compiled, is locked into a target (via e.g. #[cfg], memory layout, ABIs), and source-less recompilation to a different target is not supported. Does "single source cross-compilation" imply "multi-target crates"? Could the crate be partitioned syntactically, or is the choice of target done late? If the partitioning is syntactical, we could pursue "mixed compilation", which aligns with "proc macros in the same crate", although there would likely be too many limitations for you. Otherwise, my only suggestion is compiling for the final target (SPIR-V?) and turning that into either host code or target code, as-needed.

cc @rust-lang/compiler

It's not super clear what the motivations and goals here are. I see a lot of interesting technical details wrt prerequisites, but not much about the actual MIR-based cross-compilation backend.

It would be anything that can codegen MIR. My goal is to use LLVM codegen, but there's no reason that's required.

I was not aware of https://github.com/MaikKlein/rlsl. It looks interesting! Sadly, the optimizations done by LLVM are too nice to forego.

Today, a Rust crate, once compiled, is locked into a target (via e.g. #[cfg], memory layout, ABIs), and source-less recompilation to a different target is not supported.

Btw, I shouldn't have said ABI, at least for my use case. There are no ABI issues to be had, as there are no system libraries to call. All "function" call arguments are serialized/deserialized explicitly by the runtime/gpu-kernel into/from a single byte array. The gpu side deserialization is handled by the runtime, by inserting a wrapping function around the desired function-to-codegen.

Generally, one wants to avoid all serialization/deserialization, other than simple copies, when passing data between processors. Based on what I've seen in CUDA/C++AMP/HIP, the coprocessor (GPU, in my case) is forced to use whatever struct layout the host uses. Thus, having target specific Stuff baked into the MIR poses no issues. I say this w.r.t. host<->gpu. I'm currently ignoring linking other GPU libraries (ie gpu code only).

Does "single source cross-compilation" imply "multi-target crates"? Could the crate be partitioned syntactically, or is the choice of target done late?

Kinda? At compile time, the crate will always single target. It's up to the runtime crate to load all of metadata all crates, and codegen <desired unit of code> for <desired target>. In my case, it is done function by function, which uses a plugin intrinsic to figure out what the DefId of the function is. It also allows tailoring codegen for what hardware is present (codegen for the Vega FE, as Vega FE offers a fully universal host/GPU address space, which is nice, if one is present, for example). But one could also use something like #[cross_compile(target_machine = "amdgpu", target_cpu = "gfx903")] on functions and with another (or the same) plugin, and have the plugin drive a separate codegen instance similar to what my crate does at runtime (just as a general idea, one would have to get the ELF(?) output into a constant etc).

So to answer your second question, the plugin intrinsic is used to grab the DefId, which is used to discover what is used for that particular function (in my case, but there's no reason something fancy couldn't work using structures and trait implementations). Everything else would be discarded. The desired function is still compiled for the host as normal too.

I should have mentioned that it is possible to change the MIR before it is sent to codegen. So, for example, if I wanted to run most of a function (think all the way down the call tree) on my GPU, but the function had cold path calls to system libraries, the runtime library could (basically, there's more to this) write those calls to post to the host's AQL queue (see this) from the GPU and then have the host call the required function.

Here is a "simple" dispatch example, taken from the HSA Foundation documation:

void simple_dispatch() {
// Initialize the runtime
hsa_init();

// Retrieve the kernel agent
hsa_agent_t kernel_agent;
hsa_iterate_agents(get_kernel_agent,   &kernel_agent);

// Create a queue in the kernel agent. The queue can hold 4 packets, 
// and has no callback or service queue associated with it
hsa_queue_t *queue;

hsa_queue_create(kernel_agent, 4, HSA_QUEUE_TYPE_SINGLE, NULL, NULL, 0, 0, &queue);

// Since no packets have been enqueued yet, we use zero as 
// the packet ID and bump the write index accordingly
hsa_queue_add_write_index_relaxed(queue, 1);

uint64_t packet_id =0;
// Calculate the virtual address where to place the packet
hsa_kernel_dispatch_packet_t* packet = 
  (hsa_kernel_dispatch_packet_t*) queue->base_address + packet_id;

// Populate fields in kernel dispatch packet, except for the header, 
// the setup, and the completion signal fields
initialize_packet(packet);

// Create a signal with an initial value of one to monitor the task completion
hsa_signal_create(1, 0, NULL, &packet->completion_signal);

// Notify the queue that the packet is ready to be processed
packet_store_release((uint32_t*) packet, header(HSA_PACKET_TYPE_KERNEL_DISPATCH), 
                                    kernel_dispatch_setup());

hsa_signal_store_screlease(queue->doorbell_signal,    packet_id);

// Wait for the task to finish, which is the same as waiting for
// the value of the completion signal to be zero
while (hsa_signal_wait_scacquire(packet->completion_signal, HSA_SIGNAL_CONDITION_EQ,
                                 0, UINT64_MAX, HSA_WAIT_STATE_ACTIVE) != 0);

// Done! The kernel has completed. Time to cleanup resources and leave
hsa_signal_destroy(packet->completion_signal);
hsa_queue_destroy(queue);
hsa_shut_down();
}

It's very similar to how IPC code is written.

If the partitioning is syntactical, we could pursue "mixed compilation", which aligns with "proc macros in the same crate", although there would likely be too many limitations for you.

I'm not sure I understand what you mean. This method is directly applicable to grabbing the MIR of existing functions. No need for any syntactic changes. The runtime has access to all of the MIR, let it decide.

We can also override the default providers when the runtime codegens. So we can use a custom collect_and_partition_mono_items provider and do everything ourselves. Honestly, this should probably (ugh, lazy) be the future, instead of using -Z retrans-all-deps and a collector root override in my case.

Otherwise, my only suggestion is compiling for the final target (SPIR-V?) and turning that into either host code or target code, as-needed.

One can use any codegen module one wants, including hand rolled. My use case is directed toward native targets so I don't have to write my own codegen module (and write optimizations (!!!!!!!!) which is no small feat. LLVM is quite good) :smile:

Sorry, that was a bit long winded!

This method is directly applicable to grabbing the MIR of existing functions. No need for any syntactic changes. The runtime has access to all of the MIR, let it decide.

The problem is that that MIR is tainted with target-specific information, either via #[cfg] or through (partial) const-evaluation, of intrinsics like size_of, or just turning Rust values into bytes.

If you're using serialization, why not compile the crate twice, from source, for two different targets?

The problem is that that MIR is tainted with target-specific information, either via #[cfg] or through (partial) const-evaluation, of intrinsics like size_of, or just turning Rust values into bytes.

I suppose I assumed that. Such properties are desirable for my use case (GPUs on big(-ish) iron Linux). These devices will operate on the same memory, and so almost require such specialization anyway.

Regarding #[cfg]: I actually had a different plan for “solving” that, which has the runtime replacing function definitions by overriding the optimized_mir provider.

Sure, it's not perfect (whatever that means, anyway), but it is convenient.

If you're using serialization, why not compile the crate twice, from source, for two different targets?

That would be awful. Think of all the #[cfg()]s required. The full Rust lang can't run on GPUs anyway: no (native) virtual function calls (impossible to avoid due to panic!(), unless one is a masochist and abandons even core), exceptions, thread local. Which would then require two sources: the application code and the extremely un-Rustean kernel code that runs on the GPU, plus possibly another crate to just hold data structure definitions and maybe a few functions. Game over.

The work-arounds I've mentioned are possible because I have something that can rewrite those things by using HSA runtime features (which require Linux kernel support). No way such rewriting code would get accepted into Rust, nor should it, imo.

These days this would need to go through a major change proposal at least if it's truly compiler-internal, or a full RFC if not: https://github.com/nikomatsakis/rfcs/blob/major-change-proposal/text/0000-compiler-major-change-process.md -- closing in favor of that. Thanks!

rust-lang / rust

[Mini-RFC] Tracking issue for single source cross-compilation #51623

Issues (as implemented, so mostly issues related to my runtime impl):