Rewrite //sw/device/lib/base:memory_perftest in assembly to avoid brittle cycle counts

dmcardle commented 2 years ago

cc @cfrantz @alphan

Something in PR https://github.com/lowRISC/opentitan/pull/14044 invalidated memory_perftest's hardcoded cycle count expectations. I updated the expectations in #14399, but left the test "flaky" and "manual" because we don't yet understand why it happened.

Disassembly diff: https://gist.github.com/dmcardle/4420de24287f77edf4642b5428d04878. This compares //sw/device/lib/base:memory_perftest_prog_fpga_cw310_dis from 4a22ede583c6b79d696f61c80c2dc2b2c84e295e to 3e99db1e33bbb4988011a9bdb21ab661f666c7e9.

I'm arbitrarily picking the "memcpy" test, which changed from 190290 to 210280 cycles. In theory, we only need to pay attention to code that is measured by the perftest, i.e. code that runs between reads of the cycle counter.

That means we'll start with test_memcpy(). Its code has not changed, except that it jumps to a slightly different address for memcpy().

 OT_NOINLINE void test_memcpy(uint8_t *buf1, uint8_t *buf2, size_t len) {
   memcpy(buf1, buf2, len);
-2000080c:       6ec0106f                j       20001ef8 <memcpy>
+2000080c:       63e0106f                j       20001e4a <memcpy>

Digging into memcpy(), the instructions and relative jumps are identical, but the entire function has been shifted to lower addresses. (This appears to have happened because math code like udiv64_slow used to come immediately before memcpy, but now it comes sometime after.)

-20001ef8 <memcpy>:
+20001e4a <memcpy>:
 memcpy():
 /proc/self/cwd/sw/device/lib/base/memory.c:23
 OT_WEAK
@@ -5486,24 +5387,24 @@
   uint8_t *dest8 = (uint8_t *)dest;
   uint8_t *src8 = (uint8_t *)src;
   for (size_t i = 0; i < len; ++i) {
-20001ef8:       /----- ca11                     beqz    a2,20001f0c <memcpy+0x14>
-20001efa:       |      86aa                     mv      a3,a0
+20001e4a:       /----- ca11                     beqz    a2,20001e5e <memcpy+0x14>
+20001e4c:       |      86aa                     mv      a3,a0
 /proc/self/cwd/sw/device/lib/base/memory.c:24
     dest8[i] = src8[i];
-20001efc:       |  /-> 00058703                 lb      a4,0(a1)
-20001f00:       |  |   00e68023                 sb      a4,0(a3)
+20001e4e:       |  /-> 00058703                 lb      a4,0(a1)
+20001e52:       |  |   00e68023                 sb      a4,0(a3)
 /proc/self/cwd/sw/device/lib/base/memory.c:23
   for (size_t i = 0; i < len; ++i) {
-20001f04:       |  |   167d                     addi    a2,a2,-1
-20001f06:       |  |   0685                     addi    a3,a3,1
-20001f08:       |  |   0585                     addi    a1,a1,1
-20001f0a:       |  \-- fa6d                     bnez    a2,20001efc <memcpy+0x4>
+20001e56:       |  |   167d                     addi    a2,a2,-1
+20001e58:       |  |   0685                     addi    a3,a3,1
+20001e5a:       |  |   0585                     addi    a1,a1,1
+20001e5c:       |  \-- fa6d                     bnez    a2,20001e4e <memcpy+0x4>
 /proc/self/cwd/sw/device/lib/base/memory.c:26
   }
   return dest;
-20001f0c:       \----> 8082                     ret
+20001e5e:       \----> 8082                     ret

That's about as far as I can go with the disassembly diff since memcpy is a leaf node. I can't really explain how moving test_memcpy and memcpy closer to the test's call-site could make this test run slower. If anything, I'd expect more hits on the instruction cache.

@cfrantz Did you get further in your investigation?

tjaychen commented 2 years ago

@GregAC is there any way you can try to force the address alignment to word just to see if it makes a difference? I know ibex has lower performance when it tries to do word writes across non-word aligned boundaries (i believe it breaks it into multiple writes), but I didn't think that would happen with instruction reads...

Still it would be good if we can check the address alignments to see if they somehow make a difference. We can also help check the waves to check for any difference in behavior if you can let me know how to produce the various cases.

dmcardle commented 2 years ago

Oh, interesting! Aligning functions to 32 bits reduces the cycle counts. Instead of the "memcpy" test going from 190290 to 210280, it changes to 190300.

diff --git a/sw/device/lib/base/BUILD b/sw/device/lib/base/BUILD
index 187e9babf..572390e7e 100644
--- a/sw/device/lib/base/BUILD
+++ b/sw/device/lib/base/BUILD
@@ -84,7 +84,7 @@ cc_library(

     # This library defines memcpy(), so we can't have LLVM rewriting memcpy
     # into a call to itself.
-    copts = ["-fno-builtin"],
+    copts = ["-fno-builtin", "-falign-functions=32"],
     deps = [
         ":macros",
     ],

Here's the new disassembly diff: https://gist.github.com/dmcardle/5ebbf8da62a3aae1ddc7115e908d1e47

And for good measure, here's the results of running the perftest again. This is just 3e99db1e33bbb4988011a9bdb21ab661f666c7e9 plus the BUILD diff above.

% bazel test --test_output=streamed //sw/device/lib/base:memory_perftest_fpga_cw310
...
I00002 test_rom.c:137] Test ROM complete, jumping to flash!
I00000 ottf_main.c:80] Running sw/device/lib/base/memory_perftest.c
W00001 memory_perftest.c:235] Expected memcpy to run in 190290 cycles. Actually took 190300 cycles.
W00002 memory_perftest.c:235] Expected memcpy_zeroes to run in 190290 cycles. Actually took 190300 cycles.
W00003 memory_perftest.c:235] Expected memset to run in 140310 cycles. Actually took 130320 cycles.
W00004 memory_perftest.c:235] Expected memset_zeroes to run in 140310 cycles. Actually took 130320 cycles.
W00005 memory_perftest.c:235] Expected memcmp_pathological to run in 190340 cycles. Actually took 210300 cycles.
W00006 memory_perftest.c:235] Expected memcmp_zeroes to run in 190320 cycles. Actually took 210270 cycles.
W00007 memory_perftest.c:235] Expected memrcmp_pathological to run in 220320 cycles. Actually took 220330 cycles.
W00008 memory_perftest.c:235] Expected memrcmp_zeroes to run in 220280 cycles. Actually took 220290 cycles.
W00009 memory_perftest.c:235] Expected memchr_pathological to run in 11940 cycles. Actually took 14580 cycles.
W00010 memory_perftest.c:235] Expected memrchr_pathological to run in 44940 cycles. Actually took 47900 cycles.
I00011 ottf_main.c:56] Finished sw/device/lib/base/memory_perftest.c
I00012 status.c:34] FAIL!

tjaychen commented 2 years ago

huh.... this is just aligning the starting address of the functions to 32-bit right? Not the actual data buffers?

dmcardle commented 2 years ago

Following up, if I also add -falign-functions=32 to 4a22ede583c6b79d696f61c80c2dc2b2c84e295e, I get the same "actually took $n cycles" output, implying that 3e99db1e33bbb4988011a9bdb21ab661f666c7e9 would not have broken the expectations if we were forcing 32-bit function alignment all along.

I00000 ottf_main.c:80] Running sw/device/lib/base/memory_perftest.c
W00001 memory_perftest.c:235] Expected memcpy to run in 190290 cycles. Actually took 190300 cycles.
W00002 memory_perftest.c:235] Expected memcpy_zeroes to run in 190290 cycles. Actually took 190300 cycles.
W00003 memory_perftest.c:235] Expected memset to run in 140310 cycles. Actually took 130320 cycles.
W00004 memory_perftest.c:235] Expected memset_zeroes to run in 140310 cycles. Actually took 130320 cycles.
W00005 memory_perftest.c:235] Expected memcmp_pathological to run in 190340 cycles. Actually took 210300 cycles.
W00006 memory_perftest.c:235] Expected memcmp_zeroes to run in 190320 cycles. Actually took 210270 cycles.
W00007 memory_perftest.c:235] Expected memrcmp_pathological to run in 220320 cycles. Actually took 220330 cycles.
W00008 memory_perftest.c:235] Expected memrcmp_zeroes to run in 220280 cycles. Actually took 220290 cycles.
W00009 memory_perftest.c:235] Expected memchr_pathological to run in 11940 cycles. Actually took 14580 cycles.
W00010 memory_perftest.c:235] Expected memrchr_pathological to run in 44940 cycles. Actually took 47900 cycles.
I00011 ottf_main.c:56] Finished sw/device/lib/base/memory_perftest.c
I00012 status.c:34] FAIL!

dmcardle commented 2 years ago

huh.... this is just aligning the starting address of the functions to 32-bit right? Not the actual data buffers?

Yeah, that's the intent at least. Peeking at the disassembly diffs, I think buf1 and buf2 were always 32-bit aligned.

tjaychen commented 2 years ago

ahh... it appears to be a result of this

@GregAC can elaborate better than i can. So my memory of how transactions are split was right, just didn't realize it also applied to instruction fetches.

I guess when memcpy is not word-aligned, there must be some internal loop that isn't also. Seems like that probably causes more instruction fetches than would otherwise be needed. I think these kinds of issues will probably also get mitigated by the icache.

dmcardle commented 2 years ago

I hastily sent out PR #14438, but then I realized that -falign-functions=32 is saying to align to 32 bytes, not bits. When I use -falign-functions=4, I no longer see the same cycle counts in 4a22ede583c6b79d696f61c80c2dc2b2c84e295e and 3e99db1e33bbb4988011a9bdb21ab661f666c7e9.

Based on the doc you shared about misaligned accesses, I think aligning to 4 or 2 bytes should be fine, so maybe this was a red herring.

tjaychen commented 2 years ago

i wonder if 4B / 2B is kind of hit or miss though.. because in the end it's probably not where the entry into memcpy that matters, but where the particular loop is aligned right?

so even if memcpy entry is 2B, the loop might still be 4B aligned, or vice versa.

dmcardle commented 2 years ago

Ah, I see your point. The loop starting with memcpy's lb instruction is not colocated with the memcpy symbol because there are four bytes of setup instructions. I think it's time to organize these results:

commit	arg	&memcpy	addr of `lb` instr	perftest cycle count
4a22ede583c6b79d696f61c80c2dc2b2c84e295e	None	20001ef8	20001efc (% 4 == 0)	190290
4a22ede583c6b79d696f61c80c2dc2b2c84e295e	`-falign-functions=32`	20001f20	20001f24	190300
4a22ede583c6b79d696f61c80c2dc2b2c84e295e	`-falign-functions=16`	20001f10	20001f14	190300
4a22ede583c6b79d696f61c80c2dc2b2c84e295e	`-falign-functions=4`	20001ef8	20001efc	190290
4a22ede583c6b79d696f61c80c2dc2b2c84e295e	`-O2` (implicitly replacing `-Os`)	20001ef8	20001efc	190290

3e99db1e33bbb4988011a9bdb21ab661f666c7e9	None	20001e4a	20001e4e (% 4 == 2)	210280
3e99db1e33bbb4988011a9bdb21ab661f666c7e9	`-falign-functions=32`	20001e80	20001e84	190300
3e99db1e33bbb4988011a9bdb21ab661f666c7e9	`-falign-functions=16`	20001e60	20001e64	190300
3e99db1e33bbb4988011a9bdb21ab661f666c7e9	`-falign-functions=4`	20001e4c	20001e50	180310
3e99db1e33bbb4988011a9bdb21ab661f666c7e9	`-O2` (implicitly replacing `-Os`)	20001e4a	20001e4e (% 4 == 2)	210280

It does look like 3e99db1e33bbb4988011a9bdb21ab661f666c7e9 caused memcpy's loop to move from a word-aligned address to a non-word-aligned address. Forcing function alignment to 32 or 16 bytes happens to induce the loop to be word-aligned, but that only works because there are 4 bytes of setup instructions.

Also, the loop alignment is not the only factor affecting performance — with 4-byte function alignment, the loop is nicely aligned, but we have differing cycle counts of 190290 and 180310 cycles. Any thoughts on what this "dark matter" affecting performance might be?

tjaychen commented 2 years ago

wow this all makes sense except for the part where it gets even...better. The only thing i can think of is that in addition to where the lb is called, the other elements of the loop are probably also important. Is the 180310 case somehow... more frequently aligned? Maybe it'd be helpful to see the disassembly.

dmcardle commented 2 years ago

Yeah, that improved cycle count threw me for a loop. I managed to reproduce it by inserting __asm__(".align 2"), without passing any -falign-functions flag. This implies to me that the loop alignment is responsible for this improved performance, but I still don't understand why it's so much better.

memcpy from vanilla 3e99db1e33bbb4988011a9bdb21ab661f666c7e9. Perftest cycle count is 210280.

20001e4a <memcpy>:
memcpy():
/proc/self/cwd/sw/device/lib/base/memory.c:23
OT_WEAK
void *OT_PREFIX_IF_NOT_RV32(memcpy)(void *restrict dest,
                                    const void *restrict src, size_t len) {
  uint8_t *dest8 = (uint8_t *)dest;
  uint8_t *src8 = (uint8_t *)src;
  for (size_t i = 0; i < len; ++i) {
20001e4a:       /----- ca11                     beqz    a2,20001e5e <memcpy+0x14>
20001e4c:       |      86aa                     mv      a3,a0
/proc/self/cwd/sw/device/lib/base/memory.c:24
    dest8[i] = src8[i];
20001e4e:       |  /-> 00058703                 lb      a4,0(a1)
20001e52:       |  |   00e68023                 sb      a4,0(a3)
/proc/self/cwd/sw/device/lib/base/memory.c:23
  for (size_t i = 0; i < len; ++i) {
20001e56:       |  |   167d                     addi    a2,a2,-1
20001e58:       |  |   0685                     addi    a3,a3,1
20001e5a:       |  |   0585                     addi    a1,a1,1
20001e5c:       |  \-- fa6d                     bnez    a2,20001e4e <memcpy+0x4>
/proc/self/cwd/sw/device/lib/base/memory.c:26
  }
  return dest;
20001e5e:       \----> 8082                     ret

memcpy from 3e99db1e33bbb4988011a9bdb21ab661f666c7e9 with `-falign-functions=16`. Perftest cycle count is 190300.

20001e60 <memcpy>:
memcpy():
/proc/self/cwd/sw/device/lib/base/memory.c:23
OT_WEAK
void *OT_PREFIX_IF_NOT_RV32(memcpy)(void *restrict dest,
                                    const void *restrict src, size_t len) {
  uint8_t *dest8 = (uint8_t *)dest;
  uint8_t *src8 = (uint8_t *)src;
  for (size_t i = 0; i < len; ++i) {
20001e60:       /----- ca11                     beqz    a2,20001e74 <memcpy+0x14>
20001e62:       |      86aa                     mv      a3,a0
/proc/self/cwd/sw/device/lib/base/memory.c:24
    dest8[i] = src8[i];
20001e64:       |  /-> 00058703                 lb      a4,0(a1)
20001e68:       |  |   00e68023                 sb      a4,0(a3)
/proc/self/cwd/sw/device/lib/base/memory.c:23
  for (size_t i = 0; i < len; ++i) {
20001e6c:       |  |   167d                     addi    a2,a2,-1
20001e6e:       |  |   0685                     addi    a3,a3,1
20001e70:       |  |   0585                     addi    a1,a1,1
20001e72:       |  \-- fa6d                     bnez    a2,20001e64 <memcpy+0x4>
/proc/self/cwd/sw/device/lib/base/memory.c:26
  }
  return dest;
20001e74:       \----> 8082                     ret
20001e76:              0000                     unimp
20001e78:              0000                     unimp
20001e7a:              0000                     unimp
20001e7c:              0000                     unimp
20001e7e:              0000                     unimp

memcpy from 3e99db1e33bbb4988011a9bdb21ab661f666c7e9 with `-falign-functions=4`. Perftest cycle count is 180310.

20001e4c <memcpy>:
memcpy():
/proc/self/cwd/sw/device/lib/base/memory.c:23
OT_WEAK
void *OT_PREFIX_IF_NOT_RV32(memcpy)(void *restrict dest,
                                    const void *restrict src, size_t len) {
  uint8_t *dest8 = (uint8_t *)dest;
  uint8_t *src8 = (uint8_t *)src;
  for (size_t i = 0; i < len; ++i) {
20001e4c:       /----- ca11                     beqz    a2,20001e60 <memcpy+0x14>
20001e4e:       |      86aa                     mv      a3,a0
/proc/self/cwd/sw/device/lib/base/memory.c:24
    dest8[i] = src8[i];
20001e50:       |  /-> 00058703                 lb      a4,0(a1)
20001e54:       |  |   00e68023                 sb      a4,0(a3)
/proc/self/cwd/sw/device/lib/base/memory.c:23
  for (size_t i = 0; i < len; ++i) {
20001e58:       |  |   167d                     addi    a2,a2,-1
20001e5a:       |  |   0685                     addi    a3,a3,1
20001e5c:       |  |   0585                     addi    a1,a1,1
20001e5e:       |  \-- fa6d                     bnez    a2,20001e50 <memcpy+0x4>
/proc/self/cwd/sw/device/lib/base/memory.c:26
  }
  return dest;
20001e60:       \----> 8082                     ret
20001e62:              0000                     unimp

memcpy from 3e99db1e33bbb4988011a9bdb21ab661f666c7e9 with `asm(".align 2")` before the loop (aligning to 4 adds a bunch more NOP instructions). Perftest cycle count is 180310.

20001e4c <memcpy>:
memcpy():
/proc/self/cwd/sw/device/lib/base/memory.c:23
OT_WEAK
void *OT_PREFIX_IF_NOT_RV32(memcpy)(void *restrict dest,
                                    const void *restrict src, size_t len) {
  uint8_t *dest8 = (uint8_t *)dest;
  uint8_t *src8 = (uint8_t *)src;
  for (size_t i = 0; i < len; ++i) {
20001e4c:       /----- ca11                     beqz    a2,20001e60 <memcpy+0x14>
20001e4e:       |      86aa                     mv      a3,a0
/proc/self/cwd/sw/device/lib/base/memory.c:25
    __asm__(".align 2");
    dest8[i] = src8[i];
20001e50:       |  /-> 00058703                 lb      a4,0(a1)
20001e54:       |  |   00e68023                 sb      a4,0(a3)
/proc/self/cwd/sw/device/lib/base/memory.c:23
  for (size_t i = 0; i < len; ++i) {
20001e58:       |  |   167d                     addi    a2,a2,-1
20001e5a:       |  |   0685                     addi    a3,a3,1
20001e5c:       |  |   0585                     addi    a1,a1,1
20001e5e:       |  \-- fa6d                     bnez    a2,20001e50 <memcpy+0x4>
/proc/self/cwd/sw/device/lib/base/memory.c:27
  }
  return dest;
20001e60:       \----> 8082                     ret

tjaychen commented 2 years ago

i am honestly at a loss. I can't see any difference in terms of the word alignment for the faster and middle case. My best guess at this point is these different addresses somehow impact how the instruction prefetch buffer actually fills up. I don't know enough details about that one works though, so maybe the best we can do at this point is to simulate these and look at the waves.

Just so i understand, the cycle counts here are literally just before and after the memcpy i assume? so it shouldn't be including too much other stuff.

dmcardle commented 2 years ago

i am honestly at a loss. I can't see any difference in terms of the word alignment for the faster and middle case. My best guess at this point is these different addresses somehow impact how the instruction prefetch buffer actually fills up. I don't know enough details about that one works though, so maybe the best we can do at this point is to simulate these and look at the waves.

I haven't done this before. Will looking at the waves give us some insight into instruction fetching?

Just so i understand, the cycle counts here are literally just before and after the memcpy i assume? so it shouldn't be including too much other stuff.

The code being timed is literally just the test_memcpy function and the call it makes to the real memcpy().

GregAC commented 2 years ago

I haven't gone over all the details here but note that you can see a performance penalty when executing 32-bit instructions that aren't aligned to 32-bits in particular when jumping/branching to them. The issue is when you do an external memory read for an instruction fetch it is 32 bit aligned, so if your target crosses a 32 bit boundary you need two memory reads before you can execute the instruction.

The ICache will complicate things here. If you jump to something that's not cached then you'll defintely need two fetches before you have your instruction. If it is cached the ICache could grab the bytes both side of the boundary provided they're in the same cache line (though I'm not familar enough with the ICache internals to know if it will). We have 64-bit lines so 50% of unaligned cases will cross a cache line.

We can only read one cache line at once so even if both sides of the boundary are cached we still need two cycles within the cache to read the instruction. When we're executing straight line code (no branches/jumps) I believe the icache's prefetch buffer will take care of things so when we cross a cache line we can grab half out of the prefetch buffer and half out of a cache line read. Though as above I'm familiar enough with the ICache internals to be certain of this.

Happy to take a look at some waves if you can point me to something specific to run and take a look at.

GregAC commented 1 year ago

There's also some performance concerns seen on coremark runs, so this is potentially still a live issue requiring further investigation for M2.5.

GregAC commented 1 year ago

After some investigation no hardware performance issues found in coremark: https://github.com/lowRISC/opentitan/issues/17370

moidx commented 1 year ago

Moving to software since there were no issues found on the hardware side

johngt commented 1 year ago

Assigning to @alphan and @luismarques to decide who to forward assign to.

johngt commented 3 months ago

This appears to be a SW task and so should come out of M5. @moidx / @luismarques Marking to discuss in triage meeting / deprioritize.

luismarques commented 3 months ago

The only way to make this test not be brittle is to rewrite it in assembly. While we can play tricks with the compiler to minimize the likelihood of the cycle count being changed by other codebase changes, there's no way to prevent it completely. I can take on the rewrite task, though it doesn't seem very high priority to me.

luismarques commented 3 months ago

I looked into this more carefully. I had initially assumed this test was meant to check the performance of memory operations (after all, it's called memory_perftest), but maybe it's actually meant to check the performance of our C functions that deal with memory operations. If that's the case, of course those functions can't be rewritten in assembly only for this test. I see 3 options:

If this test is actually meant to test the memory IO itself, I can rewrite the test functions in assembly and they will become cycle-exact.

Otherwise:

Use as many compiler tricks as possible to reduce the cycle-count fragility, even if that fragility can't be completely eliminated.
Rewrite the actual memcpy, etc functions in assembly. That would provide some advantages outside the test itself, but the risk of doing that now seems quite high.

vogelpi commented 3 months ago

This isn't related to rv_core_ibex except that the Ibex executes the functions we try to measure the performance of. I don't think it should be in M5 but rather in a software-focused milestone.

How to "fix" the issue I don't know. Rewriting the whole test in assembly doesn't sound like a good use of time. Maybe it would be possible to define an expected cycle number and add some interval to absorb variations due to binary changes instead of expecting exact cycle counts?

andreaskurth commented 3 months ago

Assigning to SiVal for lack of a more appropriate SW milestone. Feel free to reassign (though we don't think it's relevant for Earlgrey-PROD TO).

lowRISC / opentitan