Replace LLVM with Cranelift

yorickpeterse commented 11 months ago

Description

We currently use LLVM, mainly because at the time of switching to a native code compiler Cranelift wasn't quite ready yet. I think since then it released the features we need (e.g. checked integer arithmetic). Switching would significantly simplify distribution and improve compile times. This issue is meant to keep track of what to keep in mind, what we'd have to change, etc.

It's important to keep in mind that we're not going to support both LLVM and Cranelift. Instead, the long-term goal is to write optimizations ourselves and use Cranelift/whatever to just turn that into machine code.

Related work

No response

Related issues

https://github.com/inko-lang/inko/issues/520: I'm looking into object caching to make up for LLVM being slow as molasses. If we switch to Cranelift, we might be able to put this aside for the foreseeable future, which I'd much prefer over implementing a complicated incremental compilation setup.

Missing Cranelift features

https://github.com/bytecodealliance/wasmtime/issues/7722

yorickpeterse commented 11 months ago

One issue I encountered last time, and which still seems present, is that it's not clear how one would produce basic debug information (enough to produce a stacktrace). I asked about this in https://github.com/bytecodealliance/wasmtime/issues/5141, but this didn't yield a simple/clear answer.

yorickpeterse commented 11 months ago

We use structure returns in a few places, mainly to interface with the Rust runtime. It seems Cranelift supports this, though it's not clear to what degree:

yorickpeterse commented 11 months ago

Tail calls are currently not supported: https://github.com/bytecodealliance/wasmtime/issues/1065. This doesn't matter much though as we don't make use of them, and LLVM's support for them is spotty as well.

yorickpeterse commented 11 months ago

LLVM has the getelementptr instruction to automatically calculate field offsets. In Cranelift you need to manually calculate these and pass them to the load() instruction. I'm not sure this is much of an issue though, as we can calculate these at compile-time and probably just pass them as a u32/whatever literal to load().

yorickpeterse commented 11 months ago

For stack slots, Cranelift has the following methods:

The declare_var defines a variable and its type. def_var assigns a value to a variable, and use_var marks it as used. The method use_var also loads the value of the variable, so no extra load() is needed for that. Cranelift then turns this into SSA for you.

yorickpeterse commented 11 months ago

Digging through the Cranelift code and associated projects, it seems producing debug info is still painful, requiring knowledge/generating of raw DWARF data through the gimli crate. I don't have such knowledge at the moment, so it will likely take a while before we can swap out LLVM with Cranelift.

yorickpeterse commented 11 months ago

Per https://docs.rs/cranelift-codegen/latest/cranelift_codegen/ir/trait.InstBuilder.html#method.atomic_load and related methods, it seems Cranelift enforces sequential ordering for atomics. This could have a rather significant impact as we use atomics in a few places, but we absolutely don't need nor want sequential consistency there.

yorickpeterse commented 11 months ago

https://github.com/bytecodealliance/wasmtime/blob/b583c54fda13b53dea362861125dd1e2ced1381d/cranelift/codegen/src/isa/x64/lower.isle#L3213-L3223 states that at least on x86, atomics are just regular loads and stores (optionally with a fence for stores). That's not too bad, but it would be nice to have precise control over this.

yorickpeterse commented 11 months ago

Cranelift doesn't have an equivalent of llvm.powi. This means we need to use a runtime function call for that.

yorickpeterse commented 11 months ago

To define constant values, it seems you have to use declare_data_in_data to define it, then use DataDescription::define() to write the value.

You then "import" to the global into a function using declare_data_in_func, and load it using global_value, at least by the looks of it.

yorickpeterse commented 11 months ago

Calling a function is done as follows:

Build the instruction using e.g. call
Obtain the results using ins_results

yorickpeterse commented 11 months ago

Function imports are done using declare_function to import the signature, followed by declare_func_in_func to reference it.

yorickpeterse commented 11 months ago

Relevant crates:

yorickpeterse commented 11 months ago

In the LLVM backend, we rely on the mem2reg pass to construct the IR in SSA form. This requires allocating temporaries using alloca() at the start of a function. For this we use llvm::builder::Builder::new_stack_slot(), which injects the alloca() calls into the start of the function, regardless of where we currently are.

In Cranelift it seems this isn't necessary, as one can just call declare_var at-will. This means we could probably get rid of new_stack_slot entirely.

yorickpeterse commented 11 months ago

Jump tables/switches are done using br_table. The jump tables are created using create_jump_table.

I believe our Switch instruction always uses monotonically increasing numbers that start at zero, so we should be able to lower that directly to the br_table instruction.

bjorn3 commented 11 months ago

I believe our Switch instruction always uses monotonically increasing numbers that start at zero

If they don't you can use cranelift_frontend::Switch to generate a combination of jump tables and a decision tree. It is almost certainly not the most optimal, but it should save some effort and any optimizations to it would benefit cg_clif too :)

bjorn3 commented 11 months ago

To define constant values, it seems you have to use declare_data_in_data to define it, then use DataDescription::define() to write the value.

declare_data_in_data is for referencing a global variable from another global variable. To define one you need to first use declare_data and then define_data with a DataDescription as argument. The DataDescription can be filled with DataDescription::define and then you can use the write_data_addr and write_function_addr with the results of declare_data_in_data and declare_func_in_data respectively to reference other global variables and functions from this global variable.

yorickpeterse commented 11 months ago

@bjorn3 Thanks for the details! :smiley:

yorickpeterse commented 11 months ago

Per Zulip and the documentation: the Value type only represents simple types such as integers and floats. For stack allocated structures, one needs to use create_sized_stack_slot and related methods and types. The stack slots registered for this are not reused through some sort of clever mechanism. This means we'd need to come up with some sort of scheme to reuse slots no longer in use, otherwise we might end up using more stack space than necessary.

I'm not sure how we'd implement that though, because when generating stack loads/stores it's not clear if the slot is still used after that. This would probably require some form of liveness analysis.

yorickpeterse commented 11 months ago

Related to that:

We'll manually need to generate the correct signatures for functions that take or return structures. This in turn needs to take into account the structure sizes, as structures <= 8 bytes are passed directly while larger ones use an extra argument.

We could simplify things by dropping support for structure returns/arguments, but this requires adjusting the runtime library in a whole bunch of places, which I'd like to avoid.

yorickpeterse commented 11 months ago

I think my biggest concern with Cranelift so far is that it's perhaps a bit too low-level. For example, we need to calculate structure sizes, field offsets, etc ourselves. Because of this, I suspect the amount of code needed in the backend would roughly equal that of our current LLVM backend, as we'd simplify things in one area but complicate them in another. For example, we have a bunch of code to specify the structure sizes and what not such that LLVM can calculate the right offsets. I had initially hoped that most of this could go away with Cranelift, but it turns out we'd have to keep most of it.

I think Cranelift would really benefit from a slightly more high-level wrapper that abstracts over the various platform differences and what not, basically LLVM--, but such a library doesn't exist unfortunately.

I'm going to experiment with doing LLVM generation in parallel, to see if/to what degree that's even possible with Inkwell. I'll also need to really think hard if Cranelift is worth the trouble in its current state.

yorickpeterse commented 11 months ago

Running LLVM/Inkwell in parallel appears tricky. In spite of some of the types being Sync, it appears that even if most of the data needed is created on a per-thread basis, Inkwell/LLVM segfaults. I'm guessing this is because we do share some StructType structures between threads in my quick hack, and those are probably mutated in some way.

Unfortunately, due to how Inkwell/LLVM is set up, we can't really work around sharing that particular data as it defines all the structure sizes and what not, and creating those for every thread is a waste.

If we do create that data on a per-thread basis, and ignore the code potentially being incorrect for a moment, compile times for the standard library tests go down from 3.3 seconds to 1.15 seconds with 8 threads. That's only 3x faster but with 8x the concurrency, which is disappointing. Note that for this experiment my queue is just Mutex<Vec<_>> and we could do better, but I'd be surprised if changing that yields another 5x improvement.

yorickpeterse commented 11 months ago

To add to the above: Inkwell exposes basically everything as immutable types/methods, even though LLVM mutates things under the hoods. This means that even if we were to get things to work in parallel, it may cause problems in the future if LLVM starts to mutate/share data where this isn't expected.

yorickpeterse commented 11 months ago

Regarding stack slots: based on how we use structure returns right now, and the fact that Inko values are heap allocated (apart from Int/Float/etc), I think stack slot reuse (or more precisely a lack thereof) isn't a big deal for the time being.

yorickpeterse commented 11 months ago

Another challenge: our FFI supports variadic function arguments, something that's needed to interface with certain C libraries (e.g. cURL). Per this issue Cranelift has no support for this whatsoever, and per this comment you basically have to emulate it.

I'm starting to lean more towards sticking with LLVM for the time being, and focusing more on making it parallel and incremental where possible, seeing as transitioning to Cranelift is likely going to require a lot of work.

yorickpeterse commented 3 months ago

I'm going to close this as there are no plans to do this in the next several years at least, in addition to Cranelift lacking various features that we need (as discussed in this thread).

inko-lang / inko