Range extension thunks - Githubissues

sorear commented 7 months ago

We are considering three use cases here.

A true large code model needs to support more than 2 GiB of text; data accesses are out of scope for this change but jumps and calls across a range of more than 2 GiB are needed. Most users of a large model will have more than 2 GiB of data but small text, or text with a highly local call pattern, so we want most calls to be able to use the auipc+jalr sequence. This would normally call for relaxation, but relaxation requires object files to contain the longest possible sequence, of which several are possible. Instead, keep the sequences the same and allow thunk insertion.
For executables and shared objects in a Unix environment, most of the code size benefits of relaxation come from call->jal relaxation, not data or TLS relaxation. If the compiler is modified to generate jal instructions instead of call instructions, the code size benefits can be achieved without relaxation at all, but this requires JAL_THUNK to avoid relocation errors at a 1 MiB limit.
If a function has many static call sites in a large binary but is known to be dynamically cold, due to a function attribute or PGO, the call sites can be replaced with jal instructions, sharing a single thunk between all call sites within a 2 MiB text region. This saves code size at small runtime cost.

Restricting the register usage of the thunks is an intentional feature copied from the Go 1.15 toolchain, where every non-leaf function requires a conditional call to runtime.morestack in the prologue; since ra cannot be saved before the stack frame is allocated, the call is performed using t0 as the return register.

There's an argument to be made that we can use thunks for JAL, CALL, and CALL_PLT because linking would fail otherwise. I'd rather not accept the risk of breakage with that. There's also an argument that CALL_PLT is always allowed to use a PLT, and PLT stubs clobber t1 and t2 even with STO_RISCV_VARIANT_CC, so substituting CALL_THUNK for CALL_PLT is always fine. I'm more sympathetic to that, but we need JAL_PLT for 2/3 use cases so why not add both.

I don't have a working PoC or a schedule for preparing a working PoC. As this is my first time on this side of the current process, what level of functional completeness are we looking for? Do I need to cover all three use cases or would it be enough to have a working clang -mshort-calls and lld combination and results showing a binary size decrease?

Recommend using CALL_THUNK for #388, FDPIC, and any other new code models or sub-ABIs.

aswaterman commented 7 months ago

For executables and shared objects in a Unix environment, most of the code size benefits of relaxation come from call->jal relaxation, not data or TLS relaxation. If the compiler is modified to generate jal instructions instead of call instructions, the code size benefits can be achieved without relaxation at all, but this requires JAL_THUNK to avoid relocation errors at a 1 MiB limit.

This seems out of scope to me, as it is a solution in search of a problem. Also, we need to support functions that are >1 MiB in size--something GCC does poorly currently but will need to get better at. That would require JAL thunks to be interspersed within a function, which sounds undesirably messy.

So, I'd focus on the large code model motivation. (The dynamically cold call site optimization is also interesting, and though it also has problems with >1 MiB functions, that's OK, because it's purely opportunistic.)

jrtc27 commented 7 months ago

PLT stubs clobber t1 and t2 even with STO_RISCV_VARIANT_CC

PLT stubs clobber t1 and t3 unconditionally (I don't know why that was picked in the first place, but hey, that's what we've got). Without STO_RISCV_VARIANT_CC and -z now / LD_BIND_NOW the header also clobbers t0 and t2, but of course the resolver then clobbers some arbitrary subset of the caller-saved register set.

but we need JAL_PLT for 2/3 use cases so why not add both.

Because that eats precious relocation encoding space for something that does nothing new. We've already got the wastage from CALL+CALL_PLT, we don't need to add yet another relocation that implementations will treat identically to the other two. RV64 has enough space that one should never run out, but RV32 is very constrained.

rui314 commented 7 months ago

This change would significantly diverge the RISC-V psABI from the psABIs of other RISC processors. Other psABIs do not distinguish range-extendable CALL relocation and the usual one. The linker automatically takes care of everything and creates range extension thunks as needed. (2) and (3) can be implemented without making a modification to the psABI, too. So I don't see a reason to add new relocations.

sorear commented 7 months ago

@aswaterman

This seems out of scope to me, as it is a solution in search of a problem.

Are you saying that code size / speed tradeoffs in general are out of scope (what about Zcmt?), or are you making a narrower point that I'm failing to see?

My current thought for the prototype is

-mshort-calls=never   don't use short calls
-mshort-calls=cold    if [[gnu::cold]] is present
-mshort-calls=always  if function/section size is less than 2 MiB

That would require JAL thunks to be interspersed within a function, which sounds undesirably messy.

The compiler knows how large the function/section is and can switch to AUIPC+JALR if external references would struggle to reach outside it, it needs to do this anyway for internal references and I've done it before (there was function larger than 1 MiB in the Go test suite).

I don't intend for the linker to ever insert bytes in the middle of an input section.

So, I'd focus on the large code model motivation. (The dynamically cold call site optimization is also interesting, and though it also has problems with >1 MiB functions, that's OK, because it's purely opportunistic.)

Unfortunately, the large code model isn't close to done, so I can't do much with it without taking over a much larger project. Likewise I am not confident in my ability to get PGO working on a nontrivial project. So short calls on the basis of a command line option / attribute are the most likely thing I'll be able to have an end to end demonstration of in a reasonable length of time.

@jrtc27

PLT stubs clobber t1 and t3 unconditionally

Good catch. Unfortunately, we need t2 as one of our registers for future Zicfilp support (range extension thunks are software-guarded branches), whether or not Zicfilp modifies the PLT stub to use t2.

we don't need to add yet another relocation that implementations will treat identically to the other two

I'll drop CALL_THUNK then. It potentially turns link failures into runtime undefined behavior if anyone is using CALL_PLT for DSO-local functions with a custom calling convention that uses t2, which won't affect -msave-restore but is anyone doing interprocedural register allocation?

@rui314

So I don't see a reason to add new relocations.

To be clear, you want to drop both CALL_THUNK and JAL_THUNK, and add range extension thunk semantics for CALL, CALL_PLT, and JAL?

I think at a minimum we'd need a clear statement that R_RISCV_JAL within a single section over less than a 1 MiB range will never generate a thunk.

aswaterman commented 7 months ago

Are you saying that code size / speed tradeoffs in general are out of scope (what about Zcmt?), or are you making a narrower point that I'm failing to see?

Not at all! My point was that motivation (2) is unconvincing unless you assume a fictitious universe in which linker relaxation does not exist. But, as I wrote, (1) is a sufficient motivation for this idea, and (3) is an interesting direction. I'm not pooh-poohing the idea; I'm just trying to help hone it.

rui314 commented 7 months ago

To be clear, you want to drop both CALL_THUNK and JAL_THUNK, and add range extension thunk semantics for CALL, CALL_PLT, and JAL?

Only CALL_PLT (and CALL as it's a synonym for CALL_PLT) would require a range extension thunk, as that's the only relocation in this ABI to call a function that may be at an arbitrary memory address.

Essentially, I don't think you need to invent something new for RISC-V for range extension thunks. You can instead just do what other RISC psABIs do. As a reference, this is what AArch64 psABI do for range extension thunks: https://github.com/ARM-software/abi-aa/blob/2a70c42d62e9c3eb5887fa50b71257f20daca6f9/aaelf64/aaelf64.rst#L1281

sorear commented 7 months ago

Take 2, now with a better description of use case 2 and with the requested feature of "surprise ABI breaks for currently working code".

There are at least three ways to get current gcc or clang to expect t2 to be valid across a call:

clang -mllvm=-enable-ipra calls to a non-interposable function which does not use t2
clang uses t2 for the ninth actual argument of a dso_local function which is converted to fastcc
GCC nested functions called by name (not a pointer) use t2 as the static chain
it is unclear if MachineFunctionSplitter or BasicBlockSections can be enabled in any effective way

What is the path forward for these? Do we change them to not use t2 and retroactively declare old compilers buggy? Do we add a new mechanism for compilers to communicate which registers they expect to live across a call? Do we do nothing, and call it user error to create a binary which exceeds 2 GiB of text when using the medlow/medany code models?

rui314 commented 7 months ago

It looks much better, but I think we need to first answer that question: do we need range extension thunks for RISC-V?

Other RISC psABIs required range extension thunks because, without them, they couldn't support a medium code model of 2 GiB binary size. They use only a single instruction for function calls, and therefore the "reach" of function call instructions is limited (typically equal to or less than ±128 MiB). On the other hand, they use longer code sequences for data access, so range extenders are not needed for data loads and stores.

In other words, other RISC-V psABI required range extension thunks to bridge the discrepancy between code and data references's reaches. To build binaries larger than 2 GiB, we usually need to build them with the large code model in the first place so that data references can refer to a location beyond ±2 GiB.

The RISC-V's medium code model doesn't have the above-mentioned issue because both code and data references can address ±2 GiB. That means the situation in which range extension thunks are useful is very limited; they're useful only when we have code scattered across more than a 2 GiB address range while all data references are within 2 GiB from the code location. There might be a program that fits into that use case, but I honestly think it would be very rare.

sorear commented 7 months ago

do we need range extension thunks for RISC-V?

No.

It's an optimization, albeit an important one that will be painful to retrofit if it isn't considered early.

That means the situation in which range extension thunks are useful is very limited; they're useful only when we have code scattered across more than a 2 GiB address range while all data references are within 2 GiB from the code location.

That's the situation in which range extension thunks are present. They are useful precisely when they are absent: less than 2 GiB of relatively contiguous code, and data scattered across the address space. I expect this to represent the majority of uses of large models.

If we want to close the call performance gap between the medium and large models, we need to either define relaxations to turn a 64-bit-range call into a 32-bit-range call and then use them on every call in every input object, or define range extension thunks and not use them.

I think that the latter option is better for several reasons:

Between non-PIC, PIC, FDPIC, different options for intermediate range, etc there are a substantial number of plausible long call sequences. We would need to add relaxation semantics for all of them. By contrast, the semantics of thunks are agnostic of the details of the call sequence.
Relaxation requires linker work in the common case when the short call is generated. Thunks require linker work in the rare case when the long call must be generated.
Thunks use the short sequence in object files, so they are smaller.
Fully relaxing a long call which uses a literal island requires deleting bytes in two independent places, an unprecedented situation for relaxation.
Thunks have secondary uses for binary size optimization which are difficult to

rui314 commented 7 months ago

All of the bullet points appear to be hypothetical and not validated by any actual experience or implementation. Utilizing a long code sequence with a full 64-bit offset for a function call and allowing the linker to relax it would be a logical expansion for the RISC-V large code model, as that's what we are doing for the medium code model.

Moreover, the linker relaxation and range extension thunks are independent. The linker is permitted to insert range extension thunks into a program, even with the current psABI, as long as it doesn't violate the ABI's assumptions (i.e., as long as range extension thunks preserve registers just like PLT entries do). Therefore, I believe defining range extension thunks at this moment is not absolutely necessary. To me, it appears that this proposal is too early to be ratified.

aswaterman commented 7 months ago

I would also favor relaxation from the hypothetical large code model to the current regime as the default approach, given that we already have the relaxation arrow in our quiver.

It’s clear to me that consideration (3) from Stef’s original proposal could bear some fruit, though it’s an optimization beyond the broader topic of large code model support.

sorear commented 7 months ago

All of the bullet points appear to be hypothetical and not validated by any actual experience or implementation.

https://review.gerrithub.io/c/riscv/riscv-go/+/352852/5

The linker is permitted to insert range extension thunks into a program, even with the current psABI, as long as it doesn't violate the ABI's assumptions (i.e., as long as range extension thunks preserve registers just like PLT entries do).

Do you consider it a bug that LLVM treats t3/X28 as preserved by call instructions for the purposes of ipra and fastcc, despite the fact that PLT entries clobber t3?

When Zicfilp happens, "PLT entries" for address not taken functions without a landing pad will need to be changed to use t2/X7 for software-guarded branch reasons. Does this mean that t2 will be removed from the list of registers preserved across a CALL_PLT? Does this mean that the register used for LLVM trampoline intrinsics and GCC nested functions will have to change to something else?

(What external constraint is responsible for Zicfilp using X7 instead of X6 in spite of the ABI problems this causes?)

rui314 commented 7 months ago

Do you consider it a bug that LLVM treats t3/X28 as preserved by call instructions for the purposes of ipra and fastcc, despite the fact that PLT entries clobber t3?

I guess it depends. ipra and fastcc doesn't follow the calling convention defined by this psABI, so I'd think they are implementation-defined optimizations.

When Zicfilp happens, "PLT entries" for address not taken functions without a landing pad will need to be changed to use t2/X7 for software-guarded branch reasons. Does this mean that t2 will be removed from the list of registers preserved across a CALL_PLT? Does this mean that the register used for LLVM trampoline intrinsics and GCC nested functions will have to change to something else?

That needs to be address in the Zicflip spec, no?

sorear commented 6 months ago

I guess it depends. ipra and fastcc doesn't follow the calling convention defined by this psABI, so I'd think they are implementation-defined optimizations.

ipra and fastcc opt out of the calling convention, but they don't opt out of the relocation scheme, and LLVM generates normal relocatable files with nothing to prevent them from being consumed by a third-party linker that implements the psABI exactly. So I don't think that "it's implementation-defined" is a blanket license to ignore relocation rules.

That needs to be address in the Zicflip spec, no?

Are you advocating changing the draft Zicfilp spec from using t2 to using t1 for compatibility with the ratified ABI, or advocating for Zicfilp to define a new ABI that reserves t2 in PLT stubs?

kito-cheng commented 6 months ago

FYI, I think...we should reconsider Zicfilp should use t1 rather than t2 for the landing pad label register https://github.com/riscv/riscv-cfi/issues/208

riscv-non-isa / riscv-elf-psabi-doc

Range extension thunks #425