Performance of misaligned loads

newpavlov commented 2 months ago

A bit of context: I was working on scalar crypto extension support for the RustCrypto project (see https://github.com/RustCrypto/hashes/pull/614) and was quite disappointed with the generated code.

Here is a simple piece of code which performs unaligned load of a 64 bit integer: https://rust.godbolt.org/z/bM5rG6zds It compiles down to 22 interdependent instructions (i.e. there is not much opportunity for CPU to execute them in parallel) and puts a fair bit of register pressure! It becomes even worse when we try to load big-endian integers (without the zbkb extension): https://rust.godbolt.org/z/TndWTK3zh (an unfortunately common occurrence in cryptographic code)

The LD instruction theoretically allows unaligned loads, but the reference is disappointingly vague about it. Behavior can range from full hardware support, followed by extremely slow emulation (IIUC slower than execution of the 22 instructions), and end with fatal trap, so portable code simply can not rely on it.

There is the Zicclsm extension, but the profiles spec is again quite vague:

Even though mandated, misaligned loads and stores might execute extremely slowly. Standard software distributions should assume their existence only for correctness, not for performance.

It's probably why enabling Zicclsm has no influence on the snippet codegen.

Finally, my questions: is it indeed true that the 22 instructions sequence is "the way" to perform potentially misaligned 64-bit loads? Why RISC-V did not introduce explicit instructions for misaligned loads/stores in one of extensions similar to the MOVUPS instruction on x86?

I know that it's far too late to change things, but, personally, I would've preferred if the spec was stricter in this regard. For example, it could've mandated unconditional fatal trap for unaligned load/store instructions in the base set and introduced explicit unaligned load/store instructions in a separate extension.

UPD: In the reddit discussion it is mentioned in the comments that MIPS had a patent on unaligned load and store instructions, but it has expired in 2019.

aswaterman commented 2 months ago

You can do much better than that 22-instruction sequence, both in terms of instruction count and register pressure: https://godbolt.org/z/jdPssx8WM

        andi    a1, a0, -8
        ld      a1, 0(a1)
        addi    a2, a0, 7
        andi    a2, a2, -8
        ld      a2, 0(a2)
        slli    a0, a0, 3
        srl     a1, a1, a0
        not     a0, a0
        slli    a2, a2, 1
        sll     a0, a2, a0
        or      a0, a0, a1

The spec's vagueness is deliberate, as it offers implementation flexibility.

As for big-endian loads, assume all general-purpose implementations will offer the Zbb extension. Only educational implementations, or highly specialized ones for which the rev8 instruction is known not to be useful, will omit the rev8 instruction. So, how best to generate code for those implementations is something of an academic exercise.

As you point out, there isn't really anything actionable here, so I'm closing the issue.

newpavlov commented 2 months ago

You can do much better than that 22-instruction sequence, both in terms of instruction count and register pressure

Strictly speaking, this snippet contains UB since you read bits outside of the allocated object. The only proper way to use this approach is through inline assembly. And it's still 11 instructions and one additional register.

As for big-endian loads, assume all general-purpose implementations will offer the Zbb extension.

Can I assume the same for Zbkb (IIUC it's a subset of Zbb)? Or alternatively, is it reasonable to assume that Zbkb is always available if Zk/Zkn is present?

there isn't really anything actionable here, so I'm closing the issue.

My actionable request is to introduce an extension with explicit unaligned load/store operations with guaranteed "reasonable" performance. Zicclsm looks borderline useless with the current wording and, as we can see, currently it does not influence compiler's codegen.

topperc commented 2 months ago

Can I assume the same for Zbkb (IIUC it's a subset of Zbb)? Or alternatively, is it reasonable to assume that Zbkb is always available if Zk/Zkn is present?

Zbkb has pack instructions that are not in Zbb. Which is unfortunate since they would improve some unaligned access sequences.

My actionable request is to introduce an extension with explicit unaligned load/store operations with guaranteed "reasonable" performance. Zicclsm looks borderline useless with the current wording and, as we can see, currently it does not influence compiler's codegen.

Compiler codegen can be controlled with -mno-scalar-strict-align on clang 18. Not sure if it made it to a gcc release yet. -mno-strict-align might work on older gcc.

aswaterman commented 2 months ago

This snippet contains UB since you read bits outside of the allocated object. The only proper way to use this approach is to use inline assembly.

Well, obviously... the C code was for illustrative purposes.

Can I assume the same for Zbkb (IIUC it's a subset of Zbb)?

Not necessarily, as it isn't a proper subset, but either way, it's probably best to assume rev8 will be there.

My actionable request is to introduce an extension with explicit unaligned load/store operations with guaranteed "reasonable" performance.

The base ISA spec already says, in so many words, that there is no appetite for that approach. https://github.com/riscv/riscv-isa-manual/blob/2eac83e3b23bcd9d1b30336b2c434b4b48cb0c0c/src/rv32.adoc#L740

Your best bet is to follow @topperc's recommendation, and/or follow in the footsteps of the glibc folks, who employ a runtime check to determine whether misaligned loads and stores are fast: https://sourceware.org/pipermail/libc-alpha/2023-February/145343.html

newpavlov commented 2 months ago

Sigh... I guess I have no choice but to pile a bunch of hacks to work around this...

The base ISA spec already says, in so many words, that there is no appetite for that approach.

/start-of-rant

From the software developer perspective the sheer number of "may"s in those many words is incredibly annoying (e.g. see this comment). The lack of exact guarantees and vagueness is maybe really nice for hardware developers, but makes it really hard to write portable performant code. The glibc stuff looks like yet another hack to work around the ISA vagueness and it's not applicable for hot loops common in cryptographic code.

In my opinion, RISC-V got itself into a weird middle ground, which is the worst of both worlds. Software developers neither can rely on a reasonable performance of misaligned loads/stores like on x86, nor they can use explicit misaligned loads/stores (again with reasonable performance) like on MIPS (or SSE/AVX x86) in situations where need for them naturally arises.

/end-of-rant

camel-cdr commented 2 months ago

You can get down to a 9 instruction sequence using just RV64I:

andi    a1, a0, -8 # round down
slli    a0, a0, 3  # offset=addr*8.
ld      a2, 0(a1)  # load left
ld      a1, 8(a1)  # load right
srl     a2, a2, a0 # left>>=(offset%64)
not     a0, a0
slli    a1, a1, 1  # right<<=1
sll     a0, a1, a0 # right<<=(~offset%64)
or      a0, a0, a2 # left|right

If you need to load adjacent misaligned data, as is commonly the case, then you can get down to just three additional instructions on average: sll+srl+or The shift amount is always fixed, and you can just use the value from the previous load: https://godbolt.org/z/WrxKf1nG8 (clang inserts a redundant slli for some reason, instead of incrementing a1)

Additional instructions for unaligned load/stores would however be something to discuss in the scalar efficiency SIG, as it already includes suggestions for three source operand integer instructions: https://docs.google.com/spreadsheets/d/1dQYU7QQ-SnIoXp9vVvVjS6Jz9vGWhwmsdbEOF3JBwUg

I was thinking about something like srl2r rd, rs1, rs2 with rd=(rd|rs1<<64)>>rs2. This would get you to one additional instruction for a sequence of unaligned loads, and for a single unaligned load it's a total of 5 instructions:

andi    a1, a0, -8
sll     a0, a0, 3
ld      a2, 0(a1)
ld      a1, 8(a1)
srl2r   a2, a1, a0

newpavlov commented 2 months ago

@camel-cdr The main problem with this hack is that it has a nasty side-effect: imagine we have an aligned input pointer and it points to the very last 64-bits in the page, now your second ld a1, 8(a1) tries to load data from the next page which may cause page fault!

Either way, all such hacks not only use more registers, but also blatantly slower on properly aligned data, which is quite common in practice, because the code requires additional loads and ALU work.

At this point, I think I will simply use ld to do the loads together with requirement to have enabled Zicclsm and I will explicitly blame the ISA spec for any potential extremely poor performance on misaligned data.

riscv / riscv-isa-manual

Performance of misaligned loads #1611