Open dmcardle opened 2 years ago
@GregAC is there any way you can try to force the address alignment to word just to see if it makes a difference? I know ibex has lower performance when it tries to do word writes across non-word aligned boundaries (i believe it breaks it into multiple writes), but I didn't think that would happen with instruction reads...
Still it would be good if we can check the address alignments to see if they somehow make a difference. We can also help check the waves to check for any difference in behavior if you can let me know how to produce the various cases.
Oh, interesting! Aligning functions to 32 bits reduces the cycle counts. Instead of the "memcpy" test going from 190290 to 210280, it changes to 190300.
diff --git a/sw/device/lib/base/BUILD b/sw/device/lib/base/BUILD
index 187e9babf..572390e7e 100644
--- a/sw/device/lib/base/BUILD
+++ b/sw/device/lib/base/BUILD
@@ -84,7 +84,7 @@ cc_library(
# This library defines memcpy(), so we can't have LLVM rewriting memcpy
# into a call to itself.
- copts = ["-fno-builtin"],
+ copts = ["-fno-builtin", "-falign-functions=32"],
deps = [
":macros",
],
Here's the new disassembly diff: https://gist.github.com/dmcardle/5ebbf8da62a3aae1ddc7115e908d1e47
And for good measure, here's the results of running the perftest again. This is just 3e99db1e33bbb4988011a9bdb21ab661f666c7e9 plus the BUILD diff above.
% bazel test --test_output=streamed //sw/device/lib/base:memory_perftest_fpga_cw310
...
I00002 test_rom.c:137] Test ROM complete, jumping to flash!
I00000 ottf_main.c:80] Running sw/device/lib/base/memory_perftest.c
W00001 memory_perftest.c:235] Expected memcpy to run in 190290 cycles. Actually took 190300 cycles.
W00002 memory_perftest.c:235] Expected memcpy_zeroes to run in 190290 cycles. Actually took 190300 cycles.
W00003 memory_perftest.c:235] Expected memset to run in 140310 cycles. Actually took 130320 cycles.
W00004 memory_perftest.c:235] Expected memset_zeroes to run in 140310 cycles. Actually took 130320 cycles.
W00005 memory_perftest.c:235] Expected memcmp_pathological to run in 190340 cycles. Actually took 210300 cycles.
W00006 memory_perftest.c:235] Expected memcmp_zeroes to run in 190320 cycles. Actually took 210270 cycles.
W00007 memory_perftest.c:235] Expected memrcmp_pathological to run in 220320 cycles. Actually took 220330 cycles.
W00008 memory_perftest.c:235] Expected memrcmp_zeroes to run in 220280 cycles. Actually took 220290 cycles.
W00009 memory_perftest.c:235] Expected memchr_pathological to run in 11940 cycles. Actually took 14580 cycles.
W00010 memory_perftest.c:235] Expected memrchr_pathological to run in 44940 cycles. Actually took 47900 cycles.
I00011 ottf_main.c:56] Finished sw/device/lib/base/memory_perftest.c
I00012 status.c:34] FAIL!
huh.... this is just aligning the starting address of the functions to 32-bit right? Not the actual data buffers?
Following up, if I also add -falign-functions=32
to 4a22ede583c6b79d696f61c80c2dc2b2c84e295e, I get the same "actually took $n cycles" output, implying that 3e99db1e33bbb4988011a9bdb21ab661f666c7e9 would not have broken the expectations if we were forcing 32-bit function alignment all along.
I00000 ottf_main.c:80] Running sw/device/lib/base/memory_perftest.c
W00001 memory_perftest.c:235] Expected memcpy to run in 190290 cycles. Actually took 190300 cycles.
W00002 memory_perftest.c:235] Expected memcpy_zeroes to run in 190290 cycles. Actually took 190300 cycles.
W00003 memory_perftest.c:235] Expected memset to run in 140310 cycles. Actually took 130320 cycles.
W00004 memory_perftest.c:235] Expected memset_zeroes to run in 140310 cycles. Actually took 130320 cycles.
W00005 memory_perftest.c:235] Expected memcmp_pathological to run in 190340 cycles. Actually took 210300 cycles.
W00006 memory_perftest.c:235] Expected memcmp_zeroes to run in 190320 cycles. Actually took 210270 cycles.
W00007 memory_perftest.c:235] Expected memrcmp_pathological to run in 220320 cycles. Actually took 220330 cycles.
W00008 memory_perftest.c:235] Expected memrcmp_zeroes to run in 220280 cycles. Actually took 220290 cycles.
W00009 memory_perftest.c:235] Expected memchr_pathological to run in 11940 cycles. Actually took 14580 cycles.
W00010 memory_perftest.c:235] Expected memrchr_pathological to run in 44940 cycles. Actually took 47900 cycles.
I00011 ottf_main.c:56] Finished sw/device/lib/base/memory_perftest.c
I00012 status.c:34] FAIL!
huh.... this is just aligning the starting address of the functions to 32-bit right? Not the actual data buffers?
Yeah, that's the intent at least. Peeking at the disassembly diffs, I think buf1
and buf2
were always 32-bit aligned.
ahh... it appears to be a result of this
@GregAC can elaborate better than i can. So my memory of how transactions are split was right, just didn't realize it also applied to instruction fetches.
I guess when memcpy
is not word-aligned, there must be some internal loop that isn't also. Seems like that probably causes more instruction fetches than would otherwise be needed. I think these kinds of issues will probably also get mitigated by the icache.
I hastily sent out PR #14438, but then I realized that -falign-functions=32
is saying to align to 32 bytes, not bits. When I use -falign-functions=4
, I no longer see the same cycle counts in 4a22ede583c6b79d696f61c80c2dc2b2c84e295e and 3e99db1e33bbb4988011a9bdb21ab661f666c7e9.
Based on the doc you shared about misaligned accesses, I think aligning to 4 or 2 bytes should be fine, so maybe this was a red herring.
i wonder if 4B / 2B is kind of hit or miss though.. because in the end it's probably not where the entry into memcpy
that matters, but where the particular loop is aligned right?
so even if memcpy entry is 2B, the loop might still be 4B aligned, or vice versa.
Ah, I see your point. The loop starting with memcpy's lb
instruction is not colocated with the memcpy
symbol because there are four bytes of setup instructions. I think it's time to organize these results:
commit | arg | &memcpy | addr of lb instr |
perftest cycle count |
---|---|---|---|---|
4a22ede583c6b79d696f61c80c2dc2b2c84e295e | None | 20001ef8 | 20001efc (% 4 == 0) | 190290 |
4a22ede583c6b79d696f61c80c2dc2b2c84e295e | -falign-functions=32 |
20001f20 | 20001f24 | 190300 |
4a22ede583c6b79d696f61c80c2dc2b2c84e295e | -falign-functions=16 |
20001f10 | 20001f14 | 190300 |
4a22ede583c6b79d696f61c80c2dc2b2c84e295e | -falign-functions=4 |
20001ef8 | 20001efc | 190290 |
4a22ede583c6b79d696f61c80c2dc2b2c84e295e | -O2 (implicitly replacing -Os ) |
20001ef8 | 20001efc | 190290 |
3e99db1e33bbb4988011a9bdb21ab661f666c7e9 | None | 20001e4a | 20001e4e (% 4 == 2) | 210280 |
3e99db1e33bbb4988011a9bdb21ab661f666c7e9 | -falign-functions=32 |
20001e80 | 20001e84 | 190300 |
3e99db1e33bbb4988011a9bdb21ab661f666c7e9 | -falign-functions=16 |
20001e60 | 20001e64 | 190300 |
3e99db1e33bbb4988011a9bdb21ab661f666c7e9 | -falign-functions=4 |
20001e4c | 20001e50 | 180310 |
3e99db1e33bbb4988011a9bdb21ab661f666c7e9 | -O2 (implicitly replacing -Os ) |
20001e4a | 20001e4e (% 4 == 2) | 210280 |
It does look like 3e99db1e33bbb4988011a9bdb21ab661f666c7e9 caused memcpy
's loop to move from a word-aligned address to a non-word-aligned address. Forcing function alignment to 32 or 16 bytes happens to induce the loop to be word-aligned, but that only works because there are 4 bytes of setup instructions.
Also, the loop alignment is not the only factor affecting performance — with 4-byte function alignment, the loop is nicely aligned, but we have differing cycle counts of 190290 and 180310 cycles. Any thoughts on what this "dark matter" affecting performance might be?
wow this all makes sense except for the part where it gets even...better.
The only thing i can think of is that in addition to where the lb
is called, the other elements of the loop are probably also important. Is the 180310
case somehow... more frequently aligned? Maybe it'd be helpful to see the disassembly.
Yeah, that improved cycle count threw me for a loop. I managed to reproduce it by inserting __asm__(".align 2")
, without passing any -falign-functions
flag. This implies to me that the loop alignment is responsible for this improved performance, but I still don't understand why it's so much better.
20001e4a <memcpy>:
memcpy():
/proc/self/cwd/sw/device/lib/base/memory.c:23
OT_WEAK
void *OT_PREFIX_IF_NOT_RV32(memcpy)(void *restrict dest,
const void *restrict src, size_t len) {
uint8_t *dest8 = (uint8_t *)dest;
uint8_t *src8 = (uint8_t *)src;
for (size_t i = 0; i < len; ++i) {
20001e4a: /----- ca11 beqz a2,20001e5e <memcpy+0x14>
20001e4c: | 86aa mv a3,a0
/proc/self/cwd/sw/device/lib/base/memory.c:24
dest8[i] = src8[i];
20001e4e: | /-> 00058703 lb a4,0(a1)
20001e52: | | 00e68023 sb a4,0(a3)
/proc/self/cwd/sw/device/lib/base/memory.c:23
for (size_t i = 0; i < len; ++i) {
20001e56: | | 167d addi a2,a2,-1
20001e58: | | 0685 addi a3,a3,1
20001e5a: | | 0585 addi a1,a1,1
20001e5c: | \-- fa6d bnez a2,20001e4e <memcpy+0x4>
/proc/self/cwd/sw/device/lib/base/memory.c:26
}
return dest;
20001e5e: \----> 8082 ret
-falign-functions=16
. Perftest cycle count is 190300.20001e60 <memcpy>:
memcpy():
/proc/self/cwd/sw/device/lib/base/memory.c:23
OT_WEAK
void *OT_PREFIX_IF_NOT_RV32(memcpy)(void *restrict dest,
const void *restrict src, size_t len) {
uint8_t *dest8 = (uint8_t *)dest;
uint8_t *src8 = (uint8_t *)src;
for (size_t i = 0; i < len; ++i) {
20001e60: /----- ca11 beqz a2,20001e74 <memcpy+0x14>
20001e62: | 86aa mv a3,a0
/proc/self/cwd/sw/device/lib/base/memory.c:24
dest8[i] = src8[i];
20001e64: | /-> 00058703 lb a4,0(a1)
20001e68: | | 00e68023 sb a4,0(a3)
/proc/self/cwd/sw/device/lib/base/memory.c:23
for (size_t i = 0; i < len; ++i) {
20001e6c: | | 167d addi a2,a2,-1
20001e6e: | | 0685 addi a3,a3,1
20001e70: | | 0585 addi a1,a1,1
20001e72: | \-- fa6d bnez a2,20001e64 <memcpy+0x4>
/proc/self/cwd/sw/device/lib/base/memory.c:26
}
return dest;
20001e74: \----> 8082 ret
20001e76: 0000 unimp
20001e78: 0000 unimp
20001e7a: 0000 unimp
20001e7c: 0000 unimp
20001e7e: 0000 unimp
-falign-functions=4
. Perftest cycle count is 180310.20001e4c <memcpy>:
memcpy():
/proc/self/cwd/sw/device/lib/base/memory.c:23
OT_WEAK
void *OT_PREFIX_IF_NOT_RV32(memcpy)(void *restrict dest,
const void *restrict src, size_t len) {
uint8_t *dest8 = (uint8_t *)dest;
uint8_t *src8 = (uint8_t *)src;
for (size_t i = 0; i < len; ++i) {
20001e4c: /----- ca11 beqz a2,20001e60 <memcpy+0x14>
20001e4e: | 86aa mv a3,a0
/proc/self/cwd/sw/device/lib/base/memory.c:24
dest8[i] = src8[i];
20001e50: | /-> 00058703 lb a4,0(a1)
20001e54: | | 00e68023 sb a4,0(a3)
/proc/self/cwd/sw/device/lib/base/memory.c:23
for (size_t i = 0; i < len; ++i) {
20001e58: | | 167d addi a2,a2,-1
20001e5a: | | 0685 addi a3,a3,1
20001e5c: | | 0585 addi a1,a1,1
20001e5e: | \-- fa6d bnez a2,20001e50 <memcpy+0x4>
/proc/self/cwd/sw/device/lib/base/memory.c:26
}
return dest;
20001e60: \----> 8082 ret
20001e62: 0000 unimp
__asm__(".align 2")
before the loop (aligning to 4 adds a bunch more NOP instructions). Perftest cycle count is 180310.20001e4c <memcpy>:
memcpy():
/proc/self/cwd/sw/device/lib/base/memory.c:23
OT_WEAK
void *OT_PREFIX_IF_NOT_RV32(memcpy)(void *restrict dest,
const void *restrict src, size_t len) {
uint8_t *dest8 = (uint8_t *)dest;
uint8_t *src8 = (uint8_t *)src;
for (size_t i = 0; i < len; ++i) {
20001e4c: /----- ca11 beqz a2,20001e60 <memcpy+0x14>
20001e4e: | 86aa mv a3,a0
/proc/self/cwd/sw/device/lib/base/memory.c:25
__asm__(".align 2");
dest8[i] = src8[i];
20001e50: | /-> 00058703 lb a4,0(a1)
20001e54: | | 00e68023 sb a4,0(a3)
/proc/self/cwd/sw/device/lib/base/memory.c:23
for (size_t i = 0; i < len; ++i) {
20001e58: | | 167d addi a2,a2,-1
20001e5a: | | 0685 addi a3,a3,1
20001e5c: | | 0585 addi a1,a1,1
20001e5e: | \-- fa6d bnez a2,20001e50 <memcpy+0x4>
/proc/self/cwd/sw/device/lib/base/memory.c:27
}
return dest;
20001e60: \----> 8082 ret
i am honestly at a loss. I can't see any difference in terms of the word alignment for the faster and middle case. My best guess at this point is these different addresses somehow impact how the instruction prefetch buffer actually fills up. I don't know enough details about that one works though, so maybe the best we can do at this point is to simulate these and look at the waves.
Just so i understand, the cycle counts here are literally just before and after the memcpy
i assume? so it shouldn't be including too much other stuff.
i am honestly at a loss. I can't see any difference in terms of the word alignment for the faster and middle case. My best guess at this point is these different addresses somehow impact how the instruction prefetch buffer actually fills up. I don't know enough details about that one works though, so maybe the best we can do at this point is to simulate these and look at the waves.
I haven't done this before. Will looking at the waves give us some insight into instruction fetching?
Just so i understand, the cycle counts here are literally just before and after the
memcpy
i assume? so it shouldn't be including too much other stuff.
The code being timed is literally just the test_memcpy
function and the call it makes to the real memcpy()
.
I haven't gone over all the details here but note that you can see a performance penalty when executing 32-bit instructions that aren't aligned to 32-bits in particular when jumping/branching to them. The issue is when you do an external memory read for an instruction fetch it is 32 bit aligned, so if your target crosses a 32 bit boundary you need two memory reads before you can execute the instruction.
The ICache will complicate things here. If you jump to something that's not cached then you'll defintely need two fetches before you have your instruction. If it is cached the ICache could grab the bytes both side of the boundary provided they're in the same cache line (though I'm not familar enough with the ICache internals to know if it will). We have 64-bit lines so 50% of unaligned cases will cross a cache line.
We can only read one cache line at once so even if both sides of the boundary are cached we still need two cycles within the cache to read the instruction. When we're executing straight line code (no branches/jumps) I believe the icache's prefetch buffer will take care of things so when we cross a cache line we can grab half out of the prefetch buffer and half out of a cache line read. Though as above I'm familiar enough with the ICache internals to be certain of this.
Happy to take a look at some waves if you can point me to something specific to run and take a look at.
There's also some performance concerns seen on coremark runs, so this is potentially still a live issue requiring further investigation for M2.5.
After some investigation no hardware performance issues found in coremark: https://github.com/lowRISC/opentitan/issues/17370
Moving to software since there were no issues found on the hardware side
Assigning to @alphan and @luismarques to decide who to forward assign to.
This appears to be a SW task and so should come out of M5. @moidx / @luismarques Marking to discuss in triage meeting / deprioritize.
The only way to make this test not be brittle is to rewrite it in assembly. While we can play tricks with the compiler to minimize the likelihood of the cycle count being changed by other codebase changes, there's no way to prevent it completely. I can take on the rewrite task, though it doesn't seem very high priority to me.
I looked into this more carefully. I had initially assumed this test was meant to check the performance of memory operations (after all, it's called memory_perftest), but maybe it's actually meant to check the performance of our C functions that deal with memory operations. If that's the case, of course those functions can't be rewritten in assembly only for this test. I see 3 options:
Otherwise:
memcpy
, etc functions in assembly. That would provide some advantages outside the test itself, but the risk of doing that now seems quite high.This isn't related to rv_core_ibex
except that the Ibex executes the functions we try to measure the performance of. I don't think it should be in M5 but rather in a software-focused milestone.
How to "fix" the issue I don't know. Rewriting the whole test in assembly doesn't sound like a good use of time. Maybe it would be possible to define an expected cycle number and add some interval to absorb variations due to binary changes instead of expecting exact cycle counts?
Assigning to SiVal for lack of a more appropriate SW milestone. Feel free to reassign (though we don't think it's relevant for Earlgrey-PROD TO).
cc @cfrantz @alphan
Something in PR https://github.com/lowRISC/opentitan/pull/14044 invalidated
memory_perftest
's hardcoded cycle count expectations. I updated the expectations in #14399, but left the test "flaky" and "manual" because we don't yet understand why it happened.Disassembly diff: https://gist.github.com/dmcardle/4420de24287f77edf4642b5428d04878. This compares
//sw/device/lib/base:memory_perftest_prog_fpga_cw310_dis
from 4a22ede583c6b79d696f61c80c2dc2b2c84e295e to 3e99db1e33bbb4988011a9bdb21ab661f666c7e9.I'm arbitrarily picking the "memcpy" test, which changed from 190290 to 210280 cycles. In theory, we only need to pay attention to code that is measured by the perftest, i.e. code that runs between reads of the cycle counter.
That means we'll start with
test_memcpy()
. Its code has not changed, except that it jumps to a slightly different address formemcpy()
.Digging into
memcpy()
, the instructions and relative jumps are identical, but the entire function has been shifted to lower addresses. (This appears to have happened becausemath
code likeudiv64_slow
used to come immediately beforememcpy
, but now it comes sometime after.)That's about as far as I can go with the disassembly diff since
memcpy
is a leaf node. I can't really explain how movingtest_memcpy
andmemcpy
closer to the test's call-site could make this test run slower. If anything, I'd expect more hits on the instruction cache.@cfrantz Did you get further in your investigation?