Relax GOT indirection into PC-relative addressing

rui314 commented 1 year ago

Some psABIs define the linker optimizations to relax a GOT load into a PC-relative address materialization. AArch64 [1] allows the linker to rewrite ADRP+ADD into ADR. x86-64 does the same thing with the R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations.

In our case, we have lots of AUIPC+LD instruction pairs to load an address from the GOT in our RISC-V programs because la assembly pseudo instruction is expanded to that instruction pair. If the PC-relative address loaded by the instruction pair is a link-time constant, we can rewrite the instructions with AUIPC+ADDI to directly materialize the value into a register, which eliminates one memory load.

We can't rewrite existing AUIPC+ADDI instruction pairs unconditionally because technically there may exist code that jumps to the middle of the instruction pair. That should be extremely rare, but we can't deny the possibility. Therefore, R_RISCV_GOT_HI20 is required to be annotated with R_RISCV_RELAX in this proposal. Currently, neither gas nor LLVM assembler emit R_RISCV_RELAX for R_RISCV_GOT_HI20. I have a WIP patch to emit R_RISCV_RELAX only for the la pseudo instruction.

[1] https://github.com/ARM-software/abi-aa/blob/844a79fd4c77252a11342709e3b27b2c9f590cf1/aaelf64/aaelf64.rst#relocation-optimization

aswaterman commented 1 year ago

The following question is more about the implementation than the ABI, but does this also mean we can sometimes garbage-collect the corresponding GOT entry? That would also serve to slightly improve spatial locality in GOT accesses.

rui314 commented 1 year ago

@aswaterman Yes we can, and the following sentence in this proposal should cover that topic.

If this relaxation eliminates all references to the symbol's GOT slot, the linker may opt not to create a GOT slot for that symbol.

rui314 commented 1 year ago

As a benchmark, lld contains about 100k R_RISCV_GOT_HI20 relocations in total in its input object files. When I tried to build lld with this relaxation, about the half of the relocations were optimized and turned into PC-relative addressing. The remaining R_RISCV_GOT_HI20 relocations referred to symbols defined by external shared objects and couldn't be optimized by this relaxation.

rui314 commented 1 year ago

Patch for GNU assembler: https://sourceware.org/pipermail/binutils/2023-September/129523.html Patch for the mold linker: https://github.com/rui314/mold/commit/a19783d82a5133dd5e2f07762761565a1c306128

MaskRay commented 1 year ago

Seems fine to me. The AArch64 scheme (I participated in the review) has the nice property that adrp+ldr costs just 2 relocations while ours costs 4. AArch64 did not want to introduce new relocation types for compatibility with existing toolchains.

Technically, we can allow GOT to PC-relative optimization even in the absence of relaxation. With R_RISCV_RELAX, we can allow removing one instruction. Also, I am not sure we need 2 R_RISCV_RELAX.

The AArch64 ABI has explicitly called out properties when an optimization can be made: non-preemptible & not-ifunc (https://maskray.me/blog/2021-01-18-gnu-indirect-function#non-preemptible-ifunc). We shall also disallow relaxation to an absolute symbol for -shared/-pie links.

rui314 commented 1 year ago

Just like AArch64, I didn't want to introduce a new relocation for backward compatibility. In hindsight, the RISC-V psABI could have defined R_RISCV_GOT_LOAD for the auipc+ld instruction pair just like it did for auipc+jalr with R_RISCV_CALL. But it's too late to make such change to the ABI.

Technically, we can allow GOT to PC-relative optimization even in the absence of relaxation. With R_RISCV_RELAX, we can allow removing one instruction. Also, I am not sure we need 2 R_RISCV_RELAX.

You made a good point. Since the second ld instruction is annotated with R_RISCV_RELAX, it might be optimized out as a result of linker relaxation, so any code that directly jumps to the ld instruction is probably wrong -- that ld might not exist at runtime. So, maybe we don't need to make a change to the assembler and can just do this optimization to the existing object file?

rui314 commented 1 year ago

I made a change to the proposal so that it requires only one R_RISCV_RELAX for each relaxable auipc+ld instruction pair.

jrtc27 commented 1 year ago

This is what I was suggesting a while back when specifying that undef weak symbols be forced to be via the GOT for medany, so thanks for doing this. This should allow us to remove the note that allows GCC to not do that.

rui314 commented 1 year ago

@jrtc27 This proposal doesn't really relax references to unresolved weak symbols for the medany code model, because we handle only PC-relative references in this proposal. Since unresolved weak symbols are handled as if they were absolute symbols at address 0, their addresses aren't PC-relative constants.

But we could relax references to absolute symbols, too, so let me add that to this proposal. I intentionally omitted it at first because I thought that the use case is rare, but weak symbols are indeed a fairly important use case.

rui314 commented 1 year ago

Updated the proposal to allow to relax an absolute symbol at a very low address. Please take a look.

rui314 commented 1 year ago

@MaskRay @jrtc27 Any comments? My change to GNU assembler was approved and can be merged once this proposal is ratified.

MaskRay commented 1 year ago

My reference to AArch64 is meant to highlight that GOT optimization is not necessarily linked with linker relaxation. GOT optimization can be performed even in the absence of R_RISCV_RELAX relocations. My article at https://maskray.me/blog/2021-08-29-all-about-global-offset-table#got-optimization describes several architectures that provide linker optimization but not linker relaxation.

I am concerned that using 2 R_RISCV_RELAX relocations may not bring enough return on investment.

The auipc+addi result is that is mostly relevant as the two other forms have stringent address requirement and are position-dependent only.

If a toolchain decides to allow auipc+addi optimization using this scheme, relocatable object files will expense 2 Elf64_Rela relocations, which are 48 bytes. Embedded applications that use -fno-pic mostly avoid GOT anyway.

. I think x86-64 GOTPCRELX provides no difference on slightly larger applications. It just optimizes some microbenchmarks.

I'll be out of town since Friday for about 10 days. Sorry for being unable to respond quickly during that time.

rui314 commented 1 year ago

@MaskRay I chose to require R_RISCV_RELAX there because the existing code could be using the value set by auipc t0, %got_pcrel_hi20(foo) in some obscure ways. If that's the case, we can't rewrite or remove the instruction.

But I might be overthinking. A value computed by %got_pcrel_hi20(foo) doesn't make sense at all if it is not combined with a corresponding %pcrel_lo(label). So I understand your point. I just chose to be on the safer side.

Is there anyone who thinks we should require 'R_RISCV_RELAX' for %got_pcrel_hi20? If no one wants it, I can remove it from the proposal.

rui314 commented 1 year ago

I think I resolved all comments. Please take another look.

kito-cheng commented 1 year ago

I tend to mark with R_RISC_RELAX, one consideration is #393, also for consistency with other relaxation, of cause we can call it optimization rather than relax, but I am not sure does it means should we also disable those optimization when -mno-relax is given? That's kind of introducing new concept into psABI I think.

rui314 commented 1 year ago

@kito-cheng Yeah, that makes sense. We have never removed an instruction or replaced it with a compact one if there's no R_RISCV_RELAX pointing to that location. Removing an auipc pointed to by R_RISCV_GOT_HI20 without RELAX looks indeed inconsistent.

kito-cheng commented 1 year ago

Another random idea come to my mind - we can call it optimization IF we don't perform any code shrinking, so we may also optimize this case with replacing auipc with nop like other architecture, also same idea can apply to other existing relaxation too.

But it just throw an idea not review comments, so don't putting into this PR :p

And that's may useful for those scenarios which disable linker relaxation.

rui314 commented 1 year ago

That we have too many RELAX relocations is indeed a problem, but that's not new and not specific to this proposal. I created an issue for the topic, so let's discuss that here: https://github.com/riscv-non-isa/riscv-elf-psabi-doc/issues/401

rui314 commented 1 year ago

Please take another look

rui314 commented 11 months ago

Ping?

rui314 commented 11 months ago

Gentle ping

MaskRay commented 11 months ago

Another random idea come to my mind - we can call it optimization IF we don't perform any code shrinking, so we may also optimize this case with replacing auipc with nop like other architecture, also same idea can apply to other existing relaxation too.

But it just throw an idea not review comments, so don't putting into this PR :p

And that's may useful for those scenarios which disable linker relaxation.

Right. Converting to the following is linker optimization, which does not shrink the section

    auipc   tX, <hi>
    addi    tY, tX, <lo>

I think this conversion does not need R_RISCV_RELAX.

This patch focuses on what we can do with R_RISCV_RELAX. We have

int6 c.li tY, <symbol-value>
int12 addi tY, zero, <symbol-value>
other auipc+addi

Frankly, I consider the first two forms not that useful. They require absolute addresses, which imply -fno-pic code. However, -fno-pic code typically doesn't generate GOT-generating relocations in the first place. int12 is rather constrained even for embedded programming, so the utilization of the relaxation for the first two will be very low. As a toolchain developer, I will not add complexity to assembler/linker for something I can control.

The last, auipc/addi, as mentioned, does not need R_RISCV_RELAX, and we have heard success story from AArch64 adrp+add.

rui314 commented 11 months ago

The forms to relax absolute symbols are for unresolved weak symbols. With these forms, we don't need to store zeros in GOT for unresolved weak symbols. I think this use case is too important that we can't dismiss.

MaskRay commented 11 months ago

The forms to relax absolute symbols are for unresolved weak symbols. With these forms, we don't need to store zeros in GOT for unresolved weak symbols. I think this use case is too important that we can't dismiss.

I think it differently. Undefined weak symbols are actually relatively rarely used (and many can be converted weak definitions) and almost every use case is non-measurable for performance/code size. I even believe most use cases are checked just once in one program invocation.

rui314 commented 11 months ago

My original proposal didn't include the relaxation of small absolute symbols. I added it as a response to https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/397#issuecomment-1725802256. So I guess you guys can discuss? @jrtc27

My opinion is it's a good relaxation because:

If we scan relocations to relax GOT-loading instructions for relative symbols anyway, there seems no reason to avoid the relaxation opportunity for small absolute symbols, and
if you executable is position-dependent and has a GOT only for weak undefined symbols, this relaxation can eliminate the GOT section entirely.

jrtc27 commented 11 months ago

if you executable is position-dependent and has a GOT only for weak undefined symbols, this relaxation can eliminate the GOT section entirely.

This was the reason given for why GCC wouldn't adopt the "use the GOT for possibly-undefined weak symbols + -mcmodel=medany" specification and it had to be weakened for the 1.0 psABI. So either that was wrong and we can just do it anyway in GCC or we need this for GCC to use the GOT in this case (and thus not require instruction rewriting with -mno-relax). Doing neither isn't a solution though.

rui314 commented 11 months ago

Any more comments?

MaskRay commented 11 months ago

Sorry, but I am still very confused.

The description is rewritten in a tone that this is for optimization.

I commented that the int6 c.li tY, <symbol-value> and int12 addi tY, zero, <symbol-value> schemes do not seem useful due to the small ranges, and therefore, very few eligible symbols. Therefore, only the auipc+addi conversion seems useful in practice. I then asked about performing GOT optimization regardless of R_RISCV_RELAX (like AArch64), which seems to not have been extensively discussed.

Another topic is whether this scheme can help unresolved undefined weak symbols. riscv64-linux-gnu-gcc -O1 -mcmodel=medany -fno-pic generates an inappropriate PC-relative code sequence for auipc+lw __attribute__((weak)) extern int x; int foo() { return x; }. (R_RISCV_PCREL_HI20/R_RISCV_RELAX + R_RISCV_PCREL_LO12_I/R_RISCV_RELAX)

For -mcmodel=medany -fno-pic, clang 16 emits the same code sequence. clang 17 has changed to auipc+ld+lw (R_RISCV_GOT_HI20 + R_RISCV_PCREL_LO12_I/R_RISCV_RELAX), which fulfills the recommendation in #201. Does GCC benefit from this proposal to move away from PC-relative to GOT? How?

Do we know whether GNU ld relaxes R_RISCV_GOT_HI20 and if no, what the lone R_RISCV_RELAX is about?

This proposal suggests that R_RISCV_GOT_HI20 needs a pairing R_RISCV_RELAX, i.e. R_RISCV_GOT_HI20/R_RISCV_RELAX + R_RISCV_PCREL_LO12_I/R_RISCV_RELAX.

This can be argued as acceptable if we consider that GCC/Clang already emit R_RISCV_PCREL_HI20/R_RISCV_RELAX + R_RISCV_PCREL_LO12_I/R_RISCV_RELAX for PC-relative addressing, so doing the same for GOT addressing is ok. However, this can also be argued as an unnecessary size regression for GOT addressing, which is used heavily for -fPIC. In addition, my arguments that short ranges and whether we can perform GOT optimization regardless of R_RISCV_RELAX stand.

rui314 commented 11 months ago

I'm not sure if I've understood your questions correctly, but as I see it:

The optimization for small absolute addresses aligns well with our recommendation to always use GOT for weak symbols. I believe this alignment should be convincing enough to adopt this approach, especially since the optimization is straightforward and could perhaps be implemented with fewer than 20 lines of code in the linker. If we don't want to do that, we probably need a strong reason to not do so.
For greater safety, we should mandate two RELAX relocs for GOT_HI20 and PCREL_LO12_I. Existing object files might contain machine code that jumps into the middle of the auipc+ld instruction sequence. While that's unlikely, it is not illegal, so we cannot guarantee that relaxing existing object files will be completely safe without a hint that the object file was generated by a compiler/assembler that is aware of this GOT optimization.

kito-cheng commented 10 months ago

For c.li and addi part: it's more like completeness of this proposal IMO, and it may only benefit only small set of embedded program, but it's simple enough so I am happy to include that.

For linker optimization without R_RISCV_RELAX, I am also happy to see that, however I would like be come a separated PR to address that part, as I mention before it's would be new concept to RISC-V psABI, it's not necessary become blocker to this PR IMO.

For GCC implementation: Yeah, it's unfortunately we don't fix that on GCC land yet...we should find some people to fix that :(

rui314 commented 10 months ago

Are there any further comments or discussions? I believe the proposal is in good shape and ready to be merged.

kito-cheng commented 10 months ago

Last ping, will merge this this Friday (2023/12/8) if no further comments.

riscv-non-isa / riscv-elf-psabi-doc

Relax GOT indirection into PC-relative addressing #397