Bugs and benchmark results

camel-cdr commented 2 months ago

Hi, @jerryz123 we were corresponding about this project over email before.

I've run my RVV benchmark on it again, and there are still a few benchmarks that run into bugs.

The results for the working benchmarks can be found here: https://camel-cdr.github.io/rvv-bench-results/saturn/index.html It's not linked from the main page yet, I'll add that once I get all benchmarks running.

The performance looks quite good across the board. I was supersized how well vrgather performed in the benchmarks considering it has a one element per cycle implementation (iirc), looks like the chaining is working well.

Crashes

{
title: "mandelbrot 100",
labels: ["0","scalar_f32","scalar_f64","rvv_f16_m1","rvv_f16_m2","rvv_f32_m1","rvv_f32_m2","rvv_f64_m2",],
data: [
[1,4,9,16,27,43,67,103,157,238,360,543,817,],
[0.0023640,0.0005272,0.0009980,0.0004778,0.0005926,0.0007676,0.0006198,0.0005908,0.0006646,0.0006643,0.0006962,0.0006454,0.0006440,],
[0.0034722,0.0004948,0.0009219,0.0004459,0.0005520,0.0007128,0.0005757,0.0005492,0.0006168,0.0006162,0.0006456,0.0005985,0.0005974,],
[15569119000] %Error: FPU.sv:325: Assertion failed in TOP.TestDriver.testHarness.chiptop0.system.tile_prci_domain.element_reset_domain_rockettile.fpuOpt: Assertion failed: FPU only supports coprocessor if FMA pipes have uniform latency List(2, 2, 3, 4, 3)
    at FPU.scala:987 assert(!io.cp_req.valid || pipes.forall(_.lat == pipes.head.lat).B,

%Error: /chipyard/sims/verilator/generated-src/chipyard.harness.TestHarness.REFV256D128RocketConfig/gen-collateral/FPU.sv:325: Verilog $stop
Aborting...

Wrong result

{
title: "strlen",
labels: ["0","scalar","scalar_autovec","libc","musl","rvv_page_aligned_m1","rvv_page_aligned_m2","rvv_page_aligned_m4","rvv_page_aligned_m8","rvv_m1","rvv_m2","rvv_m4","rvv_m8",],
data: [
[1,4,9,16,27,43,67,103,157,238,360,543,817,1228,1845,2770,4158,6240,9363,14047,21073,31612,],
[0.0054945,0.0377358,0.0681818,0.0963855,0.1279620,0.1291291,0.1587677,0.1713810,0.1798396,0.1830769,0.1908801,0.1920764,0.1936477,0.1927786,0.1928705,0.1941951,0.1949184,0.1959122,0.1947703,0.1950890,0.1954624,0.1941256,],
[0.0185185,0.0459770,0.0947368,0.1300813,0.1616766,0.1791666,0.1982248,0.2136929,0.2249283,0.2328767,0.2328589,0.2425189,0.2448306,0.2464873,0.2476510,0.2484750,0.2490864,0.2491813,0.2492944,0.2473150,0.2446820,0.2405966,],
[0.0181818,0.0563380,0.0900000,0.1159420,0.1451612,0.1769547,0.1936416,0.2132505,0.2195804,0.2297297,0.2374670,0.2416555,0.2425771,0.2457966,0.2445652,0.2483191,0.2485504,0.2488732,0.2491617,0.2475547,0.2436241,0.2398373,],
[0.0144927,0.0425531,0.0841121,0.0958083,0.1413612,0.2067307,0.3435897,0.4186991,0.4952681,0.5979899,0.7692307,0.8646496,0.9221218,0.9746031,1.0447338,1.0901219,1.0899082,1.1134903,1.1108079,1.0951118,1.0233585,0.9708845,],
[0.0096153,0.0366972,0.0865384,0.1428571,0.2368421,ERROR: rvv_page_aligned_m1 in strlen at 43[8520835000] %Error: TestHarness.sv:99: Assertion failed in TOP.TestDriver.testHarness: Assertion failed: *** FAILED *** (exit code =          1)

    at SimTSI.scala:21 assert(!error, "*** FAILED *** (exit code = %%d)\n", exit >> 1.U)

%Error: /chipyard/sims/verilator/generated-src/chipyard.harness.TestHarness.REFV256D128RocketConfig/gen-collateral/TestHarness.sv:99: Verilog $stop
Aborting...

{
title: "poly1305 aligned",
labels: ["0","boring","rvv",],
data: [
[1,4,9,16,27,43,67,103,157,238,360,543,817,1228,1845,2770,4158,6240,9363,14047,21073,31612,],
[],
[ERROR: rvv in poly1305 aligned at 1[408035000] %Error: TestHarness.sv:99: Assertion failed in TOP.TestDriver.testHarness: Assertion failed: *** FAILED *** (exit code =          1)

    at SimTSI.scala:21 assert(!error, "*** FAILED *** (exit code = %%d)\n", exit >> 1.U)

%Error: /chipyard/sims/verilator/generated-src/chipyard.harness.TestHarness.REFV256D128RocketConfig/gen-collateral/TestHarness.sv:99: Verilog $stop
Aborting...

{
title: "mergelines 2/3",
labels: ["0","scalar","rvv_vslide_m1","rvv_vslide_m2","rvv_vslide_m4","rvv_vslide_m8","rvv_vslide_skip_m1","rvv_vslide_skip_m2","rvv_vslide_skip_m4","rvv_vslide_skip_m8","rvv_mshift_m1","rvv_mshift_m2","rvv_mshift_m4","rvv_mshift_m8","rvv_mshift_skip_m1","rvv_mshift_skip_m2","rvv_mshift_skip_m4","rvv_mshift_skip_m8",],
data: [
[1,4,9,16,27,43,67,103,],
[0.0070921,0.0377358,0.0529411,0.0655737,0.0692307,0.0722689,0.0852417,0.0878839,],
[0.0106382,0.0289855,0.0616438,0.1006289,0.1443850,0.1580882,0.1825613,0.2097759,],
[0.0106382,0.0285714,0.0616438,0.1019108,0.1443850,0.1894273,0.2140575,0.2668393,],
[0.0131578,0.0279720,0.0642857,0.1019108,0.1436170,0.1990740,0.2518796,0.3083832,],
[0.0131578,0.0303030,0.0620689,0.1012658,0.1384615,0.1990740,0.2401433,0.3188854,],
[0.0096153,0.0206185,0.0526315,ERROR: rvv_vslide_skip_m1 in mergelines 2/3 at 16[1057015000] %Error: TestHarness.sv:99: Assertion failed in TOP.TestDriver.testHarness: Assertion failed: *** FAILED *** (exit code
=          1)

    at SimTSI.scala:21 assert(!error, "*** FAILED *** (exit code = %%d)\n", exit >> 1.U)

%Error: /chipyard/sims/verilator/generated-src/chipyard.harness.TestHarness.REFV256D128RocketConfig/gen-collateral/TestHarness.sv:99: Verilog $stop
Aborting...

Reproduction Environment

This is quite convoluted, but it's the best way I figured out for building chipyard projects incrementally:

# Dockerfile
FROM continuumio/miniconda3:latest

RUN apt-get update \
    && apt-get install -y build-essential wget git unzip python3 sudo file python3-vcstools libboost-dev vim cpio binutils \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

RUN conda install conda-lock

RUN git clone https://github.com/ucb-bar/chipyard.git --branch vector-release chipyard; echo 3

WORKDIR chipyard

COPY ./stage.sh ./stage.sh
RUN chmod +x ./stage.sh ; ./stage.sh 1 ; sed 's/conda activate/source activate/g' -i ./env.sh

RUN git rm --force vlsi/hammer-mentor-plugins \
    && sed 's/^.*hammer-mentor-plugins.*$//g' -i .gitmodules scripts/*.sh \
    && sed 's/git@github.com:/https:\/\/github.com\//g' -i .gitmodules .git/config

RUN ./stage.sh 2
RUN sed 's/cd generators\/constellation/cd generators\/constellation \&\& git checkout HEAD/g' -i /chipyard/scripts/build-toolchain-extra.sh
RUN ./stage.sh 3
RUN ./stage.sh 4
RUN ./stage.sh 5
RUN ./stage.sh 6
RUN ./stage.sh 7
RUN ./stage.sh 8
RUN ./stage.sh 9
RUN ./stage.sh 10

RUN git pull; scripts/init-submodules-no-riscv-tools.sh; echo 1

COPY ./build.sh ./build.sh
RUN chmod +x ./build.sh ; ./build.sh

RUN cd tests \
    && git clone --recursive https://github.com/camel-cdr/rvv-bench \
    && cd rvv-bench \
    && printf "CC=riscv64-unknown-elf-gcc\nCFLAGS=-static -fno-common -fno-builtin-printf -march=rv64gcv_zfh -specs=htif_nano.specs -O3" > config.mk \
    && sed 's/HAS_F16.*/HAS_F16 1/g;s/MAX_MEM.*/MAX_MEM (4096)/g;s/NEXT.*/NEXT(c) (c+c\/2+3)/g;s/REPEATS.*/REPEATS 1/g' -i bench/config.h \

#!/usr/bin/env bash
# stage.sh

[ $1 -gt 1 ] && source env.sh

case $1 in
    1)  ./build-setup.sh -f riscv-tools -s 2 -s 3 -s 4 -s 5 -s 6 -s 7 -s 8 -s 9 -s 10 -s 11 ;;
    2)  ./build-setup.sh -f riscv-tools -s 1 -s 3 -s 4 -s 5 -s 6 -s 7 -s 8 -s 9 -s 10 -s 11 ;;
    3)  ./build-setup.sh -f riscv-tools -s 1 -s 2 -s 4 -s 5 -s 6 -s 7 -s 8 -s 9 -s 10 -s 11 ;;
    4)  ./build-setup.sh -f riscv-tools -s 1 -s 2 -s 3 -s 5 -s 6 -s 7 -s 8 -s 9 -s 10 -s 11 ;;
    5)  ./build-setup.sh -f riscv-tools -s 1 -s 2 -s 3 -s 4 -s 6 -s 7 -s 8 -s 9 -s 10 -s 11 ;;
    6)  ./build-setup.sh -f riscv-tools -s 1 -s 2 -s 3 -s 4 -s 5 -s 7 -s 8 -s 9 -s 10 -s 11 ;;
    7)  ./build-setup.sh -f riscv-tools -s 1 -s 2 -s 3 -s 4 -s 5 -s 6 -s 8 -s 9 -s 10 -s 11 ;;
    8)  ./build-setup.sh -f riscv-tools -s 1 -s 2 -s 3 -s 4 -s 5 -s 6 -s 7 -s 9 -s 10 -s 11 ;;
    9)  ./build-setup.sh -f riscv-tools -s 1 -s 2 -s 3 -s 4 -s 5 -s 6 -s 7 -s 8 -s 10 -s 11 ;;
    10) ./build-setup.sh -f riscv-tools -s 1 -s 2 -s 3 -s 4 -s 5 -s 6 -s 7 -s 8 -s 9  -s 11 ;;
    11) ./build-setup.sh -f riscv-tools -s 1 -s 2 -s 3 -s 4 -s 5 -s 6 -s 7 -s 8 -s 9  -s 10 ;;
esac

#!/usr/bin/env bash
# build.sh

CONF=REFV256D128RocketConfig
source env.sh
make -C sims/verilator CONFIG=$CONF -j$(nproc)
make -C tests
echo "source env.sh" >> $HOME/.bashrc

$ # once you are in the Docker environment
$ make -C tests/rvv-bench/bench/ -j$(nproc)
$ make -C sims/verilator run-binary CONFIG=REFV256D128RocketConfig LOADMEM=1 EXTRA_SIM_FLAGS=+cospike-printf=0 BINARY=/chipyard/tests/rvv-bench/bench/strlen

jerryz123 commented 2 months ago

Thanks for the detailed report, and that results website is very cool. Glad to see the RVV implementation face-offs begin.

I've fixed all the bugs these benchmarks exposed in this PR to chipyard: https://github.com/ucb-bar/chipyard/pull/2003. This will be merged shortly.

BTW, the Rocket-based Saturn config seriously trades off performance for area efficiency. Many floating point operations are serialized in these designs. The most "general-purpose performance" config that makes sense is the GENV256D128ShuttleConfig. Detailed documentation on more differences between these design points is forthcoming.

I've also added Zvbb support as well as B (Zba/Zbb/Zbs) support, the byte-reverse loop should be much better now without the gatherei.

camel-cdr commented 2 months ago

That was quick, almost everything works now, only mandelbrot_rvv_f64_m2 failed:

{
title: "mandelbrot 10",
labels: ["0","scalar_f32","scalar_f64","rvv_f16_m1","rvv_f16_m2","rvv_f32_m1","rvv_f32_m2","rvv_f64_m2",],
data: [
[1,4,9,16,],
[0.0032786,0.0032414,0.0033358,0.0030870,],
[0.0056179,0.0029962,0.0030622,0.0028368,],
[0.0036496,0.0052424,0.0067516,0.0092592,],
[0.0033898,0.0050632,0.0066864,0.0091848,],
[0.0037735,0.0058997,0.0074380,0.0101781,],
[0.0038610,0.0057971,0.0074074,0.0101458,],
[460525000] %Error: TestHarness.sv:99: Assertion failed in TOP.TestDriver.testHarness: Assertion failed: *** FAILED *** (exit code =          4)

    at SimTSI.scala:21 assert(!error, "*** FAILED *** (exit code = %%d)\n", exit >> 1.U)

%Error: /chipyard/sims/verilator/generated-src/chipyard.harness.TestHarness.GENV256D128ShuttleConfig/gen-collateral/TestHarness.sv:99: Verilog $stop

The chacha20 performance was surprisingly good at a 0.13 bytes/cycle without Zvbb, which is faster than the C908,C910, and X60. With Zvbb it's even better with 1.8 bytes/cycle. The website has been updated: https://camel-cdr.github.io/rvv-bench-results/saturn/index.html

jerryz123 commented 2 months ago

Oops, sorry forgot to mention, I wasn't able to reproduce mandelbrot working correctly with Spike or Saturn. I get a misaligned load exception on a fld.

1211637 (spike) core   0: 0x00000000800012c8 (0x0000229c) c.fld   fa5, 0(a3)
1211638 core   0: exception trap_load_address_misaligned, epc 0x00000000800012c8
1211639 core   0:           tval 0x0000000080001294
1211640 (spike) core   0: >>>>  trap_entry

Command:

spike --isa=rv64gcbv_zvl256b_zbb_zvbb_zfh_zvfh_zicntr tests/rvv-bench/bench/mandelbrot

camel-cdr commented 2 months ago

My bad, you are right, I forgot to align the data. Since everything seem to work, I've added it to the main page navigation.

I still have to figure out how to run the single instruction throughput measurement, but the current problem is definitely on my side.

camel-cdr commented 2 months ago

@jerryz123

I ran my utf8 to utf16 conversion code for 20000 characters each on saturn:

lipsum/Arabic-Lipsum.utf8.txt           scalar: 0.0225054 b/c  rvv: 0.0900373 b/c  speedup: 4.0006932x
lipsum/Chinese-Lipsum.utf8.txt          scalar: 0.0296884 b/c  rvv: 0.0792564 b/c  speedup: 2.6696089x
lipsum/Emoji-Lipsum.utf8.txt            scalar: 0.0356656 b/c  rvv: 0.0656792 b/c  speedup: 1.8415244x
lipsum/Hebrew-Lipsum.utf8.txt           scalar: 0.0224919 b/c  rvv: 0.0900457 b/c  speedup: 4.0034761x
lipsum/Hindi-Lipsum.utf8.txt            scalar: 0.0278030 b/c  rvv: 0.0792438 b/c  speedup: 2.8501820x
lipsum/Japanese-Lipsum.utf8.txt         scalar: 0.0292274 b/c  rvv: 0.0792684 b/c  speedup: 2.7121197x
lipsum/Korean-Lipsum.utf8.txt           scalar: 0.0261706 b/c  rvv: 0.0791559 b/c  speedup: 3.0246053x
lipsum/Latin-Lipsum.utf8.txt            scalar: 0.1089496 b/c  rvv: 1.0249051 b/c  speedup: 9.4071435x
lipsum/Russian-Lipsum.utf8.txt          scalar: 0.0227491 b/c  rvv: 0.0901449 b/c  speedup: 3.9625538x

I didn't actually expect a speedup, because the code uses 6 vrgather.vvalways and 0 to 6 vcompress.vm depending on code path, and saturn implements those at one per cycle. There still was a big speedup though, which is great.

Compared to other processors it's still slower in scalar and rvv though:

C908:
lipsum/Arabic-Lipsum.utf8.txt           scalar: 0.0331383 b/c  rvv: 0.1696342 b/c  speedup: 5.1189761x
lipsum/Chinese-Lipsum.utf8.txt          scalar: 0.0457665 b/c  rvv: 0.1292095 b/c  speedup: 2.8232333x
lipsum/Emoji-Lipsum.utf8.txt            scalar: 0.0529478 b/c  rvv: 0.0873716 b/c  speedup: 1.6501434x
lipsum/Hebrew-Lipsum.utf8.txt           scalar: 0.0330992 b/c  rvv: 0.1703227 b/c  speedup: 5.1458171x
lipsum/Hindi-Lipsum.utf8.txt            scalar: 0.0424541 b/c  rvv: 0.1291317 b/c  speedup: 3.0416777x
lipsum/Japanese-Lipsum.utf8.txt         scalar: 0.0449738 b/c  rvv: 0.1291728 b/c  speedup: 2.8721733x
lipsum/Korean-Lipsum.utf8.txt           scalar: 0.0402183 b/c  rvv: 0.1290117 b/c  speedup: 3.2077824x
lipsum/Latin-Lipsum.utf8.txt            scalar: 0.1304180 b/c  rvv: 1.0384059 b/c  speedup: 7.9621320x
lipsum/Russian-Lipsum.utf8.txt          scalar: 0.0333600 b/c  rvv: 0.1700943 b/c  speedup: 5.0987380x

X60:
lipsum/Arabic-Lipsum.utf8.txt           scalar: 0.0358049 b/c  rvv: 0.3308416 b/c  speedup: 9.2401013x
lipsum/Chinese-Lipsum.utf8.txt          scalar: 0.0504850 b/c  rvv: 0.2533612 b/c  speedup: 5.0185424x
lipsum/Emoji-Lipsum.utf8.txt            scalar: 0.0528976 b/c  rvv: 0.1696223 b/c  speedup: 3.2066141x
lipsum/Hebrew-Lipsum.utf8.txt           scalar: 0.0355790 b/c  rvv: 0.3304208 b/c  speedup: 9.2869466x
lipsum/Hindi-Lipsum.utf8.txt            scalar: 0.0464926 b/c  rvv: 0.2534793 b/c  speedup: 5.4520358x
lipsum/Japanese-Lipsum.utf8.txt         scalar: 0.0489283 b/c  rvv: 0.2532353 b/c  speedup: 5.1756344x
lipsum/Korean-Lipsum.utf8.txt           scalar: 0.0436021 b/c  rvv: 0.2531742 b/c  speedup: 5.8064559x
lipsum/Latin-Lipsum.utf8.txt            scalar: 0.1869340 b/c  rvv: 1.4262712 b/c  speedup: 7.6298090x
lipsum/Russian-Lipsum.utf8.txt          scalar: 0.0359793 b/c  rvv: 0.3318491 b/c  speedup: 9.2233155x

Is it possible that at least the scalar part is slower, because of the DRAMSim configuration? How can I change that to a faster one?

jerryz123 commented 2 months ago

I wasn't able to run the bench_8to16 binary, I get a store access fault in spike rather quickly somewhere in nolibc_main:

core   0: 0x00000000800001ee (0x000052fd) c.li    t0, -1
core   0: 0x00000000800001f0 (0x0110000f) fence   w,w
core   0: 0x00000000800001f4 (0xe851a023) sw      t0, -384(gp)
core   0: 0x00000000800001f8 (0x00004505) c.li    a0, 1
core   0: 0x00000000800001fa (0x00002597) auipc   a1, 0x2
core   0: 0x00000000800001fe (0x60658593) addi    a1, a1, 1542
core   0: 0x0000000080000202 (0x8181b603) ld      a2, -2024(gp)
core   0: 0x0000000080000206 (0x02a000ef) jal     pc + 0x2a
core   0: >>>>  main
core   0: 0x0000000080000230 (0x00001101) c.addi  sp, -32
core   0: 0x0000000080000232 (0x0000ec06) c.sdsp  ra, 24(sp)
core   0: 0x0000000080000234 (0x0000e822) c.sdsp  s0, 16(sp)
core   0: 0x0000000080000236 (0x0000e426) c.sdsp  s1, 8(sp)
core   0: 0x0000000080000238 (0x0000e04a) c.sdsp  s2, 0(sp)
core   0: 0x000000008000023a (0x016010ef) jal     pc + 0x1016
core   0: >>>>  nolibc_main
core   0: 0x0000000080001250 (0x0000715d) c.addi16sp sp, -80
core   0: 0x0000000080001252 (0x0000f84a) c.sdsp  s2, 48(sp)
core   0: 0x0000000080001254 (0x00001917) auipc   s2, 0x1
core   0: 0x0000000080001258 (0x6ec90913) addi    s2, s2, 1772
core   0: 0x000000008000125c (0x00093783) ld      a5, 0(s2)
core   0: 0x0000000080001260 (0xf60002b7) lui     t0, 0xf6000
core   0: 0x0000000080001264 (0x0000e0a2) c.sdsp  s0, 64(sp)
core   0: 0x0000000080001266 (0x0000fc26) c.sdsp  s1, 56(sp)
core   0: 0x0000000080001268 (0x0000e486) c.sdsp  ra, 72(sp)
core   0: 0x000000008000126a (0x0000f44e) c.sdsp  s3, 40(sp)
core   0: 0x000000008000126c (0x0000ac22) c.fsdsp fs0, 24(sp)
core   0: 0x000000008000126e (0x0000a826) c.fsdsp fs1, 16(sp)
core   0: 0x0000000080001270 (0xf6000737) lui     a4, 0xf6000
core   0: 0x0000000080001274 (0x00009116) c.add   sp, t0
core   0: 0x0000000080001276 (0x0a0006b7) lui     a3, 0xa000
core   0: 0x000000008000127a (0x000096ba) c.add   a3, a4
core   0: 0x000000008000127c (0x00000818) c.addi4spn a4, sp, 16
core   0: 0x000000008000127e (0x00009736) c.add   a4, a3
core   0: 0x0000000080001280 (0x00006794) c.ld    a3, 8(a5)
core   0: 0x0000000080001282 (0x02000637) lui     a2, 0x2000
core   0: 0x0000000080001286 (0x00004585) c.li    a1, 1
core   0: 0x0000000080001288 (0x0000853a) c.mv    a0, a4
core   0: 0x000000008000128a (0x0000e03a) c.sdsp  a4, 0(sp)
core   0: exception trap_store_access_fault, epc 0x000000008000128a
core   0:           tval 0x000000007602bf90

As for the performance, I would guess the dual-issue Shuttle scalar core just can't achieve as high IPC as the other implementations, but I'd like to dig into this myself.

camel-cdr commented 2 months ago

@jerryz123 The default bench.c is build to read input from stdin, for running on saturn I modified it as follows:

#define NOLIBC_MAIN
#include "../nolibc.h"
#include "scalar.h"

size_t utf8_to_utf16_rvv(char const *src, size_t n, uint16_t *dest);

static char in[] = {
// copy past output of "head -c 1000 Lanuage-Lipsum.utf8.txt | xxd -i | tr -d ' \n'"
// the utf8.txt file comes from https://github.com/lemire/unicode_lipsum/ lipsum/
};
static uint64_t out[sizeof in];

int
main(void)
{
    size_t inSize = sizeof in;
    print("start\n")(flush,);

    for (size_t i = 0; i < 3; ++i) {
        uint64_t beg, end;

        beg = rv_cycles();
        utf8_to_utf16_scalar((void*)in, inSize, (void*)out);
        end = rv_cycles();

        double scalar_bc = inSize * 1.0 / (end - beg);

        beg = rv_cycles();
        utf8_to_utf16_rvv((void*)in, inSize, (void*)out);
        end = rv_cycles();

        double rvv_bc = inSize * 1.0 / (end - beg);

        print("scalar: ")(f,scalar_bc)(" b/c  rvv: ")(f,rvv_bc)(" b/c  speedup: ")(f,rvv_bc/scalar_bc)("x\n")(flush,);
    }

    return 0;
}

I'll try to run it on spike as well, but I'm not used to working with it, qemu seems a lot easier, but apparently less correct.

The C908 and X60 are also dual issue in-order cores.

jerryz123 commented 2 months ago

I get a undefined reference error when replacing the bench.c:

riscv64-unknown-elf-gcc -static -fno-common -fno-builtin-printf -march=rv64gcv_zfh -specs=htif_nano.specs -O3 -DNAME=utf8_to_utf32 8toN_gather.c bench.c -o bench_8to32
/nscratch/jerryz/chipyard-proj/tools-13/lib/gcc/riscv64-unknown-elf/13.2.0/../../../../riscv64-unknown-elf/bin/ld: warning: bench_8to32 has a LOAD segment with RWX permissions
/nscratch/jerryz/chipyard-proj/tools-13/lib/gcc/riscv64-unknown-elf/13.2.0/../../../../riscv64-unknown-elf/bin/ld: /tmp/ccywNTBP.o: in function `nolibc_main':
bench.c:(.text+0xc62): undefined reference to `utf8_to_utf16_rvv'
collect2: error: ld returned 1 exit status
make: *** [Makefile:13: bench_8to32] Error 1

If you send me a ELF that you ran in Saturn, that would be easiest.

camel-cdr commented 2 months ago

@jerryz123 ah, try make bench_8to16, the modification assumes that. I can send you an elf as well, here is one with 1000 characters from all of the languages: bench_8to16.zip

jerryz123 commented 2 months ago

I noticed that this benchmark tends to include many back-to-back loads and stores. Unfortunately, the Shuttle dual-issue core does not support a dual-issue L1 data cache, so those blocks are serialized here. If the other implementations do have a dual-issue cache port, then this might account for at least part of the degradation.

I did notice one opportunity to make rgathers faster, avoiding an unnecessary stall, bringing rvv's performance up to >0.1 b/c . I'll push this update soon.

ucb-bar / saturn-vectors