Open teknoraver opened 3 years ago
@MichaelZhuxx @davidlt @tekkamanninja does this match what you see currently?
That are the same numbers that I measured while writing the C memcpy
.
I made this kernel code:
preempt_disable();
for (i = 0; i < sizeof(void*); i++) {
for (j = 0; j < sizeof(void*); j++) {
t0 = ktime_get();
memcpy(dst + j, src + i, PG_SIZE - max(i, j));
t1 = ktime_get();
printk("Strings selftest: memcpy(src+%d, dst+%d): %llu Mb/s\n", i, j, PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
}
}
for (i = 0; i < sizeof(void*); i++) {
t0 = ktime_get();
memset(dst + i, 0, PG_SIZE - i);
t1 = ktime_get();
printk("Strings selftest: memset(dst+%d): %llu Mb/s\n", i, PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
}
preempt_enable();
Which give this output:
root@beaglev:~# rmmod test_string ; modprobe test_string
rmmod: ERROR: Module test_string is not currently loaded
[ 69.698527] Strings selftest: testing with size: 4194304
[ 69.723021] Strings selftest: memcpy(src+0, dst+0): 213 Mb/s
[ 69.746692] Strings selftest: memcpy(src+0, dst+1): 222 Mb/s
[ 69.770174] Strings selftest: memcpy(src+0, dst+2): 224 Mb/s
[ 69.793547] Strings selftest: memcpy(src+0, dst+3): 225 Mb/s
[ 69.816835] Strings selftest: memcpy(src+0, dst+4): 227 Mb/s
[ 69.840224] Strings selftest: memcpy(src+0, dst+5): 225 Mb/s
[ 69.863564] Strings selftest: memcpy(src+0, dst+6): 226 Mb/s
[ 69.886847] Strings selftest: memcpy(src+0, dst+7): 227 Mb/s
[ 69.910232] Strings selftest: memcpy(src+1, dst+0): 226 Mb/s
[ 69.932790] Strings selftest: memcpy(src+1, dst+1): 236 Mb/s
[ 69.956132] Strings selftest: memcpy(src+1, dst+2): 226 Mb/s
[ 69.979467] Strings selftest: memcpy(src+1, dst+3): 226 Mb/s
[ 70.002808] Strings selftest: memcpy(src+1, dst+4): 226 Mb/s
[ 70.026143] Strings selftest: memcpy(src+1, dst+5): 226 Mb/s
[ 70.049453] Strings selftest: memcpy(src+1, dst+6): 227 Mb/s
[ 70.072797] Strings selftest: memcpy(src+1, dst+7): 226 Mb/s
[ 70.096133] Strings selftest: memcpy(src+2, dst+0): 226 Mb/s
[ 70.119441] Strings selftest: memcpy(src+2, dst+1): 227 Mb/s
[ 70.142006] Strings selftest: memcpy(src+2, dst+2): 236 Mb/s
[ 70.165370] Strings selftest: memcpy(src+2, dst+3): 225 Mb/s
[ 70.188711] Strings selftest: memcpy(src+2, dst+4): 226 Mb/s
[ 70.212048] Strings selftest: memcpy(src+2, dst+5): 226 Mb/s
[ 70.235376] Strings selftest: memcpy(src+2, dst+6): 226 Mb/s
[ 70.258700] Strings selftest: memcpy(src+2, dst+7): 226 Mb/s
[ 70.282063] Strings selftest: memcpy(src+3, dst+0): 225 Mb/s
[ 70.305410] Strings selftest: memcpy(src+3, dst+1): 225 Mb/s
[ 70.328712] Strings selftest: memcpy(src+3, dst+2): 227 Mb/s
[ 70.351305] Strings selftest: memcpy(src+3, dst+3): 235 Mb/s
[ 70.374670] Strings selftest: memcpy(src+3, dst+4): 225 Mb/s
[ 70.397941] Strings selftest: memcpy(src+3, dst+5): 227 Mb/s
[ 70.421308] Strings selftest: memcpy(src+3, dst+6): 226 Mb/s
[ 70.444628] Strings selftest: memcpy(src+3, dst+7): 226 Mb/s
[ 70.467890] Strings selftest: memcpy(src+4, dst+0): 227 Mb/s
[ 70.491305] Strings selftest: memcpy(src+4, dst+1): 225 Mb/s
[ 70.514641] Strings selftest: memcpy(src+4, dst+2): 226 Mb/s
[ 70.537944] Strings selftest: memcpy(src+4, dst+3): 227 Mb/s
[ 70.560565] Strings selftest: memcpy(src+4, dst+4): 236 Mb/s
[ 70.583914] Strings selftest: memcpy(src+4, dst+5): 225 Mb/s
[ 70.607209] Strings selftest: memcpy(src+4, dst+6): 227 Mb/s
[ 70.630602] Strings selftest: memcpy(src+4, dst+7): 225 Mb/s
[ 70.653941] Strings selftest: memcpy(src+5, dst+0): 226 Mb/s
[ 70.677238] Strings selftest: memcpy(src+5, dst+1): 227 Mb/s
[ 70.700576] Strings selftest: memcpy(src+5, dst+2): 226 Mb/s
[ 70.723944] Strings selftest: memcpy(src+5, dst+3): 225 Mb/s
[ 70.747327] Strings selftest: memcpy(src+5, dst+4): 226 Mb/s
[ 70.769949] Strings selftest: memcpy(src+5, dst+5): 236 Mb/s
[ 70.793287] Strings selftest: memcpy(src+5, dst+6): 226 Mb/s
[ 70.816585] Strings selftest: memcpy(src+5, dst+7): 227 Mb/s
[ 70.839973] Strings selftest: memcpy(src+6, dst+0): 225 Mb/s
[ 70.863334] Strings selftest: memcpy(src+6, dst+1): 225 Mb/s
[ 70.886604] Strings selftest: memcpy(src+6, dst+2): 227 Mb/s
[ 70.909953] Strings selftest: memcpy(src+6, dst+3): 226 Mb/s
[ 70.933301] Strings selftest: memcpy(src+6, dst+4): 225 Mb/s
[ 70.956558] Strings selftest: memcpy(src+6, dst+5): 227 Mb/s
[ 70.979147] Strings selftest: memcpy(src+6, dst+6): 236 Mb/s
[ 71.002507] Strings selftest: memcpy(src+6, dst+7): 225 Mb/s
[ 71.025827] Strings selftest: memcpy(src+7, dst+0): 226 Mb/s
[ 71.049101] Strings selftest: memcpy(src+7, dst+1): 227 Mb/s
[ 71.072444] Strings selftest: memcpy(src+7, dst+2): 226 Mb/s
[ 71.095817] Strings selftest: memcpy(src+7, dst+3): 225 Mb/s
[ 71.119118] Strings selftest: memcpy(src+7, dst+4): 227 Mb/s
[ 71.142458] Strings selftest: memcpy(src+7, dst+5): 226 Mb/s
[ 71.165795] Strings selftest: memcpy(src+7, dst+6): 226 Mb/s
[ 71.188308] Strings selftest: memcpy(src+7, dst+7): 237 Mb/s
[ 71.194013] Strings selftest: testing with size: 4194304
[ 71.217246] Strings selftest: memset(dst+0): 226 Mb/s
[ 71.238157] Strings selftest: memset(dst+1): 252 Mb/s
[ 71.258795] Strings selftest: memset(dst+2): 257 Mb/s
[ 71.279346] Strings selftest: memset(dst+3): 257 Mb/s
[ 71.299862] Strings selftest: memset(dst+4): 258 Mb/s
[ 71.320378] Strings selftest: memset(dst+5): 258 Mb/s
[ 71.340886] Strings selftest: memset(dst+6): 258 Mb/s
[ 71.361419] Strings selftest: memset(dst+7): 258 Mb/s
[ 71.366475] String selftests succeeded
Most RISC-V implementations don't have HW misaligned load/store available so on these implementations OpenSBI will get trap-n-emulate for misaligned load/store using unprivileged access (MSTATUS.MPRV).
Regards, Anup
@avpatel I'm doing aligned accesses. Unaligned accesses are 20x slower
size: 1024 Mb
read size: 64 bit
unalignment: 1 byte
elapsed time: 93.46 sec
throughput: 10.96 Mb/s
@teknoraver I had similar observation in-past for Raspi[1|2] boards. This may be also related to slower DRAM (i.e. LPDDR) or some other system configuration of low-cost boards.
BTW, with SBI v0.3 spec we now have SBI PMU firmware events which can be used to know amount of misaligned load/store handling, remote TLB flushes, etc done by OpenSBI
Regards, Anup
@teknoraver
That are the same numbers that I measured while writing the C
memcpy
. I made this kernel code:
The tool I've been using for the past +20 years is at https://gist.github.com/geertu/f86486959f68b2337bd395d41a5d832c
read: 255 MiB/s write: 294 MiB/s copy: 18 MiB/s
I'm a bit surprised by the copy performance, which is much less than what you got. Usually it lies in the range of 50-100% of the minimum of read and write performance.
@davidlt any suggestions how we could improve performance?
Don't know what's causing this or if this is HW bottleneck.
Hi,
Things now much better for VisionFive2's JH7110:
user@starfive:~$ ./unalign_check -r1
size: 100 Mb
read size: 8 bit
unalignment: 0 byte
elapsed time: 0.29 sec
throughput: 340.97 Mb/s
user@starfive:~$ ./unalign_check -r8
size: 100 Mb
read size: 64 bit
unalignment: 0 byte
elapsed time: 0.05 sec
throughput: 2076.48 Mb/s
user@starfive:~$ ./unalign_check -w1
size: 100 Mb
write size: 8 bit
unalignment: 0 byte
elapsed time: 0.29 sec
throughput: 340.09 Mb/s
user@starfive:~$ ./unalign_check -w8
size: 100 Mb
write size: 64 bit
unalignment: 0 byte
elapsed time: 0.13 sec
throughput: 754.71 Mb/s
But what's about JH7100?
The version of the U74 cores on the old JH700 has a very slow memory controller just like the U74 cores in the FU740 chip on the Unleashed board.
On top of that the first BeagleV Starlight boards shipped with a version of the "SecondBoot" initialization code that didn't enable some bits. On the BeagleV board memory access can be speeded up to the (still slow) level of the VisionFive V1 by updating SecondBoot.
I found access to memory pretty slow in the BeagleV. I can't go past 280 mbytes (which means 36 M reads per second):
As comparison, this is the same test on a different system with a 2 GHz CPU and 2400MHz DDR4 memory:
The tool I'm using is at: https://gist.github.com/teknoraver/36f471ef97d4c6a6cb11148e72f9e975