Closed yorickpeterse closed 3 months ago
One issue I encountered last time, and which still seems present, is that it's not clear how one would produce basic debug information (enough to produce a stacktrace). I asked about this in https://github.com/bytecodealliance/wasmtime/issues/5141, but this didn't yield a simple/clear answer.
We use structure returns in a few places, mainly to interface with the Rust runtime. It seems Cranelift supports this, though it's not clear to what degree:
Tail calls are currently not supported: https://github.com/bytecodealliance/wasmtime/issues/1065. This doesn't matter much though as we don't make use of them, and LLVM's support for them is spotty as well.
LLVM has the getelementptr
instruction to automatically calculate field offsets. In Cranelift you need to manually calculate these and pass them to the load()
instruction. I'm not sure this is much of an issue though, as we can calculate these at compile-time and probably just pass them as a u32
/whatever literal to load()
.
For stack slots, Cranelift has the following methods:
The declare_var
defines a variable and its type. def_var
assigns a value to a variable, and use_var
marks it as used. The method use_var
also loads the value of the variable, so no extra load()
is needed for that. Cranelift then turns this into SSA for you.
Digging through the Cranelift code and associated projects, it seems producing debug info is still painful, requiring knowledge/generating of raw DWARF data through the gimli crate. I don't have such knowledge at the moment, so it will likely take a while before we can swap out LLVM with Cranelift.
Per https://docs.rs/cranelift-codegen/latest/cranelift_codegen/ir/trait.InstBuilder.html#method.atomic_load and related methods, it seems Cranelift enforces sequential ordering for atomics. This could have a rather significant impact as we use atomics in a few places, but we absolutely don't need nor want sequential consistency there.
https://github.com/bytecodealliance/wasmtime/blob/b583c54fda13b53dea362861125dd1e2ced1381d/cranelift/codegen/src/isa/x64/lower.isle#L3213-L3223 states that at least on x86, atomics are just regular loads and stores (optionally with a fence for stores). That's not too bad, but it would be nice to have precise control over this.
Cranelift doesn't have an equivalent of llvm.powi
. This means we need to use a runtime function call for that.
To define constant values, it seems you have to use declare_data_in_data
to define it, then use DataDescription::define()
to write the value.
You then "import" to the global into a function using declare_data_in_func
, and load it using global_value
, at least by the looks of it.
Calling a function is done as follows:
call
ins_results
Function imports are done using declare_function
to import the signature, followed by declare_func_in_func
to reference it.
In the LLVM backend, we rely on the mem2reg pass to construct the IR in SSA form. This requires allocating temporaries using alloca()
at the start of a function. For this we use llvm::builder::Builder::new_stack_slot()
, which injects the alloca()
calls into the start of the function, regardless of where we currently are.
In Cranelift it seems this isn't necessary, as one can just call declare_var
at-will. This means we could probably get rid of new_stack_slot
entirely.
Jump tables/switches are done using br_table
. The jump tables are created using create_jump_table
.
I believe our Switch
instruction always uses monotonically increasing numbers that start at zero, so we should be able to lower that directly to the br_table
instruction.
I believe our Switch instruction always uses monotonically increasing numbers that start at zero
If they don't you can use cranelift_frontend::Switch
to generate a combination of jump tables and a decision tree. It is almost certainly not the most optimal, but it should save some effort and any optimizations to it would benefit cg_clif too :)
To define constant values, it seems you have to use declare_data_in_data to define it, then use DataDescription::define() to write the value.
declare_data_in_data is for referencing a global variable from another global variable. To define one you need to first use declare_data
and then define_data
with a DataDescription
as argument. The DataDescription
can be filled with DataDescription::define
and then you can use the write_data_addr
and write_function_addr
with the results of declare_data_in_data
and declare_func_in_data
respectively to reference other global variables and functions from this global variable.
@bjorn3 Thanks for the details! :smiley:
Per Zulip and the documentation: the Value
type only represents simple types such as integers and floats. For stack allocated structures, one needs to use create_sized_stack_slot
and related methods and types. The stack slots registered for this are not reused through some sort of clever mechanism. This means we'd need to come up with some sort of scheme to reuse slots no longer in use, otherwise we might end up using more stack space than necessary.
I'm not sure how we'd implement that though, because when generating stack loads/stores it's not clear if the slot is still used after that. This would probably require some form of liveness analysis.
Related to that:
We'll manually need to generate the correct signatures for functions that take or return structures. This in turn needs to take into account the structure sizes, as structures <= 8 bytes are passed directly while larger ones use an extra argument.
We could simplify things by dropping support for structure returns/arguments, but this requires adjusting the runtime library in a whole bunch of places, which I'd like to avoid.
I think my biggest concern with Cranelift so far is that it's perhaps a bit too low-level. For example, we need to calculate structure sizes, field offsets, etc ourselves. Because of this, I suspect the amount of code needed in the backend would roughly equal that of our current LLVM backend, as we'd simplify things in one area but complicate them in another. For example, we have a bunch of code to specify the structure sizes and what not such that LLVM can calculate the right offsets. I had initially hoped that most of this could go away with Cranelift, but it turns out we'd have to keep most of it.
I think Cranelift would really benefit from a slightly more high-level wrapper that abstracts over the various platform differences and what not, basically LLVM--, but such a library doesn't exist unfortunately.
I'm going to experiment with doing LLVM generation in parallel, to see if/to what degree that's even possible with Inkwell. I'll also need to really think hard if Cranelift is worth the trouble in its current state.
Running LLVM/Inkwell in parallel appears tricky. In spite of some of the types being Sync
, it appears that even if most of the data needed is created on a per-thread basis, Inkwell/LLVM segfaults. I'm guessing this is because we do share some StructType
structures between threads in my quick hack, and those are probably mutated in some way.
Unfortunately, due to how Inkwell/LLVM is set up, we can't really work around sharing that particular data as it defines all the structure sizes and what not, and creating those for every thread is a waste.
If we do create that data on a per-thread basis, and ignore the code potentially being incorrect for a moment, compile times for the standard library tests go down from 3.3 seconds to 1.15 seconds with 8 threads. That's only 3x faster but with 8x the concurrency, which is disappointing. Note that for this experiment my queue is just Mutex<Vec<_>>
and we could do better, but I'd be surprised if changing that yields another 5x improvement.
To add to the above: Inkwell exposes basically everything as immutable types/methods, even though LLVM mutates things under the hoods. This means that even if we were to get things to work in parallel, it may cause problems in the future if LLVM starts to mutate/share data where this isn't expected.
Regarding stack slots: based on how we use structure returns right now, and the fact that Inko values are heap allocated (apart from Int/Float/etc), I think stack slot reuse (or more precisely a lack thereof) isn't a big deal for the time being.
Another challenge: our FFI supports variadic function arguments, something that's needed to interface with certain C libraries (e.g. cURL). Per this issue Cranelift has no support for this whatsoever, and per this comment you basically have to emulate it.
I'm starting to lean more towards sticking with LLVM for the time being, and focusing more on making it parallel and incremental where possible, seeing as transitioning to Cranelift is likely going to require a lot of work.
I'm going to close this as there are no plans to do this in the next several years at least, in addition to Cranelift lacking various features that we need (as discussed in this thread).
Description
We currently use LLVM, mainly because at the time of switching to a native code compiler Cranelift wasn't quite ready yet. I think since then it released the features we need (e.g. checked integer arithmetic). Switching would significantly simplify distribution and improve compile times. This issue is meant to keep track of what to keep in mind, what we'd have to change, etc.
It's important to keep in mind that we're not going to support both LLVM and Cranelift. Instead, the long-term goal is to write optimizations ourselves and use Cranelift/whatever to just turn that into machine code.
Related work
No response
Related issues
Missing Cranelift features