Closed camel-cdr closed 2 months ago
Thanks for the detailed report, and that results website is very cool. Glad to see the RVV implementation face-offs begin.
I've fixed all the bugs these benchmarks exposed in this PR to chipyard: https://github.com/ucb-bar/chipyard/pull/2003. This will be merged shortly.
BTW, the Rocket-based Saturn config seriously trades off performance for area efficiency. Many floating point operations are serialized in these designs. The most "general-purpose performance" config that makes sense is the GENV256D128ShuttleConfig. Detailed documentation on more differences between these design points is forthcoming.
I've also added Zvbb support as well as B (Zba/Zbb/Zbs) support, the byte-reverse loop should be much better now without the gatherei.
That was quick, almost everything works now, only mandelbrot_rvv_f64_m2
failed:
{
title: "mandelbrot 10",
labels: ["0","scalar_f32","scalar_f64","rvv_f16_m1","rvv_f16_m2","rvv_f32_m1","rvv_f32_m2","rvv_f64_m2",],
data: [
[1,4,9,16,],
[0.0032786,0.0032414,0.0033358,0.0030870,],
[0.0056179,0.0029962,0.0030622,0.0028368,],
[0.0036496,0.0052424,0.0067516,0.0092592,],
[0.0033898,0.0050632,0.0066864,0.0091848,],
[0.0037735,0.0058997,0.0074380,0.0101781,],
[0.0038610,0.0057971,0.0074074,0.0101458,],
[460525000] %Error: TestHarness.sv:99: Assertion failed in TOP.TestDriver.testHarness: Assertion failed: *** FAILED *** (exit code = 4)
at SimTSI.scala:21 assert(!error, "*** FAILED *** (exit code = %%d)\n", exit >> 1.U)
%Error: /chipyard/sims/verilator/generated-src/chipyard.harness.TestHarness.GENV256D128ShuttleConfig/gen-collateral/TestHarness.sv:99: Verilog $stop
The chacha20 performance was surprisingly good at a 0.13 bytes/cycle without Zvbb, which is faster than the C908,C910, and X60. With Zvbb it's even better with 1.8 bytes/cycle. The website has been updated: https://camel-cdr.github.io/rvv-bench-results/saturn/index.html
Oops, sorry forgot to mention, I wasn't able to reproduce mandelbrot working correctly with Spike or Saturn.
I get a misaligned load exception on a fld
.
1211637 (spike) core 0: 0x00000000800012c8 (0x0000229c) c.fld fa5, 0(a3)
1211638 core 0: exception trap_load_address_misaligned, epc 0x00000000800012c8
1211639 core 0: tval 0x0000000080001294
1211640 (spike) core 0: >>>> trap_entry
Command:
spike --isa=rv64gcbv_zvl256b_zbb_zvbb_zfh_zvfh_zicntr tests/rvv-bench/bench/mandelbrot
My bad, you are right, I forgot to align the data. Since everything seem to work, I've added it to the main page navigation.
I still have to figure out how to run the single instruction throughput measurement, but the current problem is definitely on my side.
@jerryz123
I ran my utf8 to utf16 conversion code for 20000 characters each on saturn:
lipsum/Arabic-Lipsum.utf8.txt scalar: 0.0225054 b/c rvv: 0.0900373 b/c speedup: 4.0006932x
lipsum/Chinese-Lipsum.utf8.txt scalar: 0.0296884 b/c rvv: 0.0792564 b/c speedup: 2.6696089x
lipsum/Emoji-Lipsum.utf8.txt scalar: 0.0356656 b/c rvv: 0.0656792 b/c speedup: 1.8415244x
lipsum/Hebrew-Lipsum.utf8.txt scalar: 0.0224919 b/c rvv: 0.0900457 b/c speedup: 4.0034761x
lipsum/Hindi-Lipsum.utf8.txt scalar: 0.0278030 b/c rvv: 0.0792438 b/c speedup: 2.8501820x
lipsum/Japanese-Lipsum.utf8.txt scalar: 0.0292274 b/c rvv: 0.0792684 b/c speedup: 2.7121197x
lipsum/Korean-Lipsum.utf8.txt scalar: 0.0261706 b/c rvv: 0.0791559 b/c speedup: 3.0246053x
lipsum/Latin-Lipsum.utf8.txt scalar: 0.1089496 b/c rvv: 1.0249051 b/c speedup: 9.4071435x
lipsum/Russian-Lipsum.utf8.txt scalar: 0.0227491 b/c rvv: 0.0901449 b/c speedup: 3.9625538x
I didn't actually expect a speedup, because the code uses 6 vrgather.vv
always and 0 to 6 vcompress.vm
depending on code path, and saturn implements those at one per cycle.
There still was a big speedup though, which is great.
Compared to other processors it's still slower in scalar and rvv though:
C908:
lipsum/Arabic-Lipsum.utf8.txt scalar: 0.0331383 b/c rvv: 0.1696342 b/c speedup: 5.1189761x
lipsum/Chinese-Lipsum.utf8.txt scalar: 0.0457665 b/c rvv: 0.1292095 b/c speedup: 2.8232333x
lipsum/Emoji-Lipsum.utf8.txt scalar: 0.0529478 b/c rvv: 0.0873716 b/c speedup: 1.6501434x
lipsum/Hebrew-Lipsum.utf8.txt scalar: 0.0330992 b/c rvv: 0.1703227 b/c speedup: 5.1458171x
lipsum/Hindi-Lipsum.utf8.txt scalar: 0.0424541 b/c rvv: 0.1291317 b/c speedup: 3.0416777x
lipsum/Japanese-Lipsum.utf8.txt scalar: 0.0449738 b/c rvv: 0.1291728 b/c speedup: 2.8721733x
lipsum/Korean-Lipsum.utf8.txt scalar: 0.0402183 b/c rvv: 0.1290117 b/c speedup: 3.2077824x
lipsum/Latin-Lipsum.utf8.txt scalar: 0.1304180 b/c rvv: 1.0384059 b/c speedup: 7.9621320x
lipsum/Russian-Lipsum.utf8.txt scalar: 0.0333600 b/c rvv: 0.1700943 b/c speedup: 5.0987380x
X60:
lipsum/Arabic-Lipsum.utf8.txt scalar: 0.0358049 b/c rvv: 0.3308416 b/c speedup: 9.2401013x
lipsum/Chinese-Lipsum.utf8.txt scalar: 0.0504850 b/c rvv: 0.2533612 b/c speedup: 5.0185424x
lipsum/Emoji-Lipsum.utf8.txt scalar: 0.0528976 b/c rvv: 0.1696223 b/c speedup: 3.2066141x
lipsum/Hebrew-Lipsum.utf8.txt scalar: 0.0355790 b/c rvv: 0.3304208 b/c speedup: 9.2869466x
lipsum/Hindi-Lipsum.utf8.txt scalar: 0.0464926 b/c rvv: 0.2534793 b/c speedup: 5.4520358x
lipsum/Japanese-Lipsum.utf8.txt scalar: 0.0489283 b/c rvv: 0.2532353 b/c speedup: 5.1756344x
lipsum/Korean-Lipsum.utf8.txt scalar: 0.0436021 b/c rvv: 0.2531742 b/c speedup: 5.8064559x
lipsum/Latin-Lipsum.utf8.txt scalar: 0.1869340 b/c rvv: 1.4262712 b/c speedup: 7.6298090x
lipsum/Russian-Lipsum.utf8.txt scalar: 0.0359793 b/c rvv: 0.3318491 b/c speedup: 9.2233155x
Is it possible that at least the scalar part is slower, because of the DRAMSim configuration? How can I change that to a faster one?
I wasn't able to run the bench_8to16
binary, I get a store access fault in spike rather quickly somewhere in nolibc_main
:
core 0: 0x00000000800001ee (0x000052fd) c.li t0, -1
core 0: 0x00000000800001f0 (0x0110000f) fence w,w
core 0: 0x00000000800001f4 (0xe851a023) sw t0, -384(gp)
core 0: 0x00000000800001f8 (0x00004505) c.li a0, 1
core 0: 0x00000000800001fa (0x00002597) auipc a1, 0x2
core 0: 0x00000000800001fe (0x60658593) addi a1, a1, 1542
core 0: 0x0000000080000202 (0x8181b603) ld a2, -2024(gp)
core 0: 0x0000000080000206 (0x02a000ef) jal pc + 0x2a
core 0: >>>> main
core 0: 0x0000000080000230 (0x00001101) c.addi sp, -32
core 0: 0x0000000080000232 (0x0000ec06) c.sdsp ra, 24(sp)
core 0: 0x0000000080000234 (0x0000e822) c.sdsp s0, 16(sp)
core 0: 0x0000000080000236 (0x0000e426) c.sdsp s1, 8(sp)
core 0: 0x0000000080000238 (0x0000e04a) c.sdsp s2, 0(sp)
core 0: 0x000000008000023a (0x016010ef) jal pc + 0x1016
core 0: >>>> nolibc_main
core 0: 0x0000000080001250 (0x0000715d) c.addi16sp sp, -80
core 0: 0x0000000080001252 (0x0000f84a) c.sdsp s2, 48(sp)
core 0: 0x0000000080001254 (0x00001917) auipc s2, 0x1
core 0: 0x0000000080001258 (0x6ec90913) addi s2, s2, 1772
core 0: 0x000000008000125c (0x00093783) ld a5, 0(s2)
core 0: 0x0000000080001260 (0xf60002b7) lui t0, 0xf6000
core 0: 0x0000000080001264 (0x0000e0a2) c.sdsp s0, 64(sp)
core 0: 0x0000000080001266 (0x0000fc26) c.sdsp s1, 56(sp)
core 0: 0x0000000080001268 (0x0000e486) c.sdsp ra, 72(sp)
core 0: 0x000000008000126a (0x0000f44e) c.sdsp s3, 40(sp)
core 0: 0x000000008000126c (0x0000ac22) c.fsdsp fs0, 24(sp)
core 0: 0x000000008000126e (0x0000a826) c.fsdsp fs1, 16(sp)
core 0: 0x0000000080001270 (0xf6000737) lui a4, 0xf6000
core 0: 0x0000000080001274 (0x00009116) c.add sp, t0
core 0: 0x0000000080001276 (0x0a0006b7) lui a3, 0xa000
core 0: 0x000000008000127a (0x000096ba) c.add a3, a4
core 0: 0x000000008000127c (0x00000818) c.addi4spn a4, sp, 16
core 0: 0x000000008000127e (0x00009736) c.add a4, a3
core 0: 0x0000000080001280 (0x00006794) c.ld a3, 8(a5)
core 0: 0x0000000080001282 (0x02000637) lui a2, 0x2000
core 0: 0x0000000080001286 (0x00004585) c.li a1, 1
core 0: 0x0000000080001288 (0x0000853a) c.mv a0, a4
core 0: 0x000000008000128a (0x0000e03a) c.sdsp a4, 0(sp)
core 0: exception trap_store_access_fault, epc 0x000000008000128a
core 0: tval 0x000000007602bf90
As for the performance, I would guess the dual-issue Shuttle scalar core just can't achieve as high IPC as the other implementations, but I'd like to dig into this myself.
@jerryz123 The default bench.c is build to read input from stdin, for running on saturn I modified it as follows:
#define NOLIBC_MAIN
#include "../nolibc.h"
#include "scalar.h"
size_t utf8_to_utf16_rvv(char const *src, size_t n, uint16_t *dest);
static char in[] = {
// copy past output of "head -c 1000 Lanuage-Lipsum.utf8.txt | xxd -i | tr -d ' \n'"
// the utf8.txt file comes from https://github.com/lemire/unicode_lipsum/ lipsum/
};
static uint64_t out[sizeof in];
int
main(void)
{
size_t inSize = sizeof in;
print("start\n")(flush,);
for (size_t i = 0; i < 3; ++i) {
uint64_t beg, end;
beg = rv_cycles();
utf8_to_utf16_scalar((void*)in, inSize, (void*)out);
end = rv_cycles();
double scalar_bc = inSize * 1.0 / (end - beg);
beg = rv_cycles();
utf8_to_utf16_rvv((void*)in, inSize, (void*)out);
end = rv_cycles();
double rvv_bc = inSize * 1.0 / (end - beg);
print("scalar: ")(f,scalar_bc)(" b/c rvv: ")(f,rvv_bc)(" b/c speedup: ")(f,rvv_bc/scalar_bc)("x\n")(flush,);
}
return 0;
}
I'll try to run it on spike as well, but I'm not used to working with it, qemu seems a lot easier, but apparently less correct.
The C908 and X60 are also dual issue in-order cores.
I get a undefined reference error when replacing the bench.c
:
riscv64-unknown-elf-gcc -static -fno-common -fno-builtin-printf -march=rv64gcv_zfh -specs=htif_nano.specs -O3 -DNAME=utf8_to_utf32 8toN_gather.c bench.c -o bench_8to32
/nscratch/jerryz/chipyard-proj/tools-13/lib/gcc/riscv64-unknown-elf/13.2.0/../../../../riscv64-unknown-elf/bin/ld: warning: bench_8to32 has a LOAD segment with RWX permissions
/nscratch/jerryz/chipyard-proj/tools-13/lib/gcc/riscv64-unknown-elf/13.2.0/../../../../riscv64-unknown-elf/bin/ld: /tmp/ccywNTBP.o: in function `nolibc_main':
bench.c:(.text+0xc62): undefined reference to `utf8_to_utf16_rvv'
collect2: error: ld returned 1 exit status
make: *** [Makefile:13: bench_8to32] Error 1
If you send me a ELF that you ran in Saturn, that would be easiest.
@jerryz123 ah, try make bench_8to16
, the modification assumes that.
I can send you an elf as well, here is one with 1000 characters from all of the languages: bench_8to16.zip
I noticed that this benchmark tends to include many back-to-back loads and stores. Unfortunately, the Shuttle dual-issue core does not support a dual-issue L1 data cache, so those blocks are serialized here. If the other implementations do have a dual-issue cache port, then this might account for at least part of the degradation.
I did notice one opportunity to make rgathers faster, avoiding an unnecessary stall, bringing rvv's performance up to >0.1 b/c . I'll push this update soon.
Hi, @jerryz123 we were corresponding about this project over email before.
I've run my RVV benchmark on it again, and there are still a few benchmarks that run into bugs.
The results for the working benchmarks can be found here: https://camel-cdr.github.io/rvv-bench-results/saturn/index.html It's not linked from the main page yet, I'll add that once I get all benchmarks running.
The performance looks quite good across the board. I was supersized how well vrgather performed in the benchmarks considering it has a one element per cycle implementation (iirc), looks like the chaining is working well.
Crashes
Wrong result
Reproduction Environment
This is quite convoluted, but it's the best way I figured out for building chipyard projects incrementally: