starfive-tech / linux

Other
154 stars 115 forks source link

RAM access is slow #27

Open teknoraver opened 3 years ago

teknoraver commented 3 years ago

I found access to memory pretty slow in the BeagleV. I can't go past 280 mbytes (which means 36 M reads per second):

# unalign_check -r1
size:           100 Mb
read size:      8 bit
unalignment:    0 byte
elapsed time:   0.64 sec
throughput:     155.19 Mb/s
# unalign_check -r8
size:           100 Mb
read size:      64 bit
unalignment:    0 byte
elapsed time:   0.36 sec
throughput:     276.69 Mb/s
# unalign_check -w1
size:           100 Mb
write size:     8 bit
unalignment:    0 byte
elapsed time:   0.72 sec
throughput:     138.50 Mb/s
# unalign_check -w8
size:           100 Mb
write size:     64 bit
unalignment:    0 byte
elapsed time:   0.42 sec
throughput:     239.11 Mb/s
#

As comparison, this is the same test on a different system with a 2 GHz CPU and 2400MHz DDR4 memory:

# unalign_check -r1
size:           100 Mb
read size:      8 bit
unalignment:    0 byte
elapsed time:   0.10 sec
throughput:     952.71 Mb/s
# unalign_check -r8
size:           100 Mb
read size:      64 bit
unalignment:    0 byte
elapsed time:   0.01 sec
throughput:     7240.48 Mb/s
# unalign_check -w1
size:           100 Mb
write size:     8 bit
unalignment:    0 byte
elapsed time:   0.10 sec
throughput:     953.16 Mb/s
# unalign_check -w8
size:           100 Mb
write size:     64 bit
unalignment:    0 byte
elapsed time:   0.01 sec
throughput:     7625.81 Mb/s
#

The tool I'm using is at: https://gist.github.com/teknoraver/36f471ef97d4c6a6cb11148e72f9e975

pdp7 commented 3 years ago

@MichaelZhuxx @davidlt @tekkamanninja does this match what you see currently?

teknoraver commented 3 years ago

That are the same numbers that I measured while writing the C memcpy. I made this kernel code:

    preempt_disable();

    for (i = 0; i < sizeof(void*); i++) {
        for (j = 0; j < sizeof(void*); j++) {
            t0 = ktime_get();
            memcpy(dst + j, src + i, PG_SIZE - max(i, j));
            t1 = ktime_get();
            printk("Strings selftest: memcpy(src+%d, dst+%d): %llu Mb/s\n", i, j, PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
        }
    }

    for (i = 0; i < sizeof(void*); i++) {
        t0 = ktime_get();
        memset(dst + i, 0, PG_SIZE - i);
        t1 = ktime_get();
        printk("Strings selftest: memset(dst+%d): %llu Mb/s\n", i, PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
    }

    preempt_enable();

Which give this output:

root@beaglev:~# rmmod test_string ; modprobe test_string
rmmod: ERROR: Module test_string is not currently loaded
[   69.698527] Strings selftest: testing with size: 4194304
[   69.723021] Strings selftest: memcpy(src+0, dst+0): 213 Mb/s
[   69.746692] Strings selftest: memcpy(src+0, dst+1): 222 Mb/s
[   69.770174] Strings selftest: memcpy(src+0, dst+2): 224 Mb/s
[   69.793547] Strings selftest: memcpy(src+0, dst+3): 225 Mb/s
[   69.816835] Strings selftest: memcpy(src+0, dst+4): 227 Mb/s
[   69.840224] Strings selftest: memcpy(src+0, dst+5): 225 Mb/s
[   69.863564] Strings selftest: memcpy(src+0, dst+6): 226 Mb/s
[   69.886847] Strings selftest: memcpy(src+0, dst+7): 227 Mb/s
[   69.910232] Strings selftest: memcpy(src+1, dst+0): 226 Mb/s
[   69.932790] Strings selftest: memcpy(src+1, dst+1): 236 Mb/s
[   69.956132] Strings selftest: memcpy(src+1, dst+2): 226 Mb/s
[   69.979467] Strings selftest: memcpy(src+1, dst+3): 226 Mb/s
[   70.002808] Strings selftest: memcpy(src+1, dst+4): 226 Mb/s
[   70.026143] Strings selftest: memcpy(src+1, dst+5): 226 Mb/s
[   70.049453] Strings selftest: memcpy(src+1, dst+6): 227 Mb/s
[   70.072797] Strings selftest: memcpy(src+1, dst+7): 226 Mb/s
[   70.096133] Strings selftest: memcpy(src+2, dst+0): 226 Mb/s
[   70.119441] Strings selftest: memcpy(src+2, dst+1): 227 Mb/s
[   70.142006] Strings selftest: memcpy(src+2, dst+2): 236 Mb/s
[   70.165370] Strings selftest: memcpy(src+2, dst+3): 225 Mb/s
[   70.188711] Strings selftest: memcpy(src+2, dst+4): 226 Mb/s
[   70.212048] Strings selftest: memcpy(src+2, dst+5): 226 Mb/s
[   70.235376] Strings selftest: memcpy(src+2, dst+6): 226 Mb/s
[   70.258700] Strings selftest: memcpy(src+2, dst+7): 226 Mb/s
[   70.282063] Strings selftest: memcpy(src+3, dst+0): 225 Mb/s
[   70.305410] Strings selftest: memcpy(src+3, dst+1): 225 Mb/s
[   70.328712] Strings selftest: memcpy(src+3, dst+2): 227 Mb/s
[   70.351305] Strings selftest: memcpy(src+3, dst+3): 235 Mb/s
[   70.374670] Strings selftest: memcpy(src+3, dst+4): 225 Mb/s
[   70.397941] Strings selftest: memcpy(src+3, dst+5): 227 Mb/s
[   70.421308] Strings selftest: memcpy(src+3, dst+6): 226 Mb/s
[   70.444628] Strings selftest: memcpy(src+3, dst+7): 226 Mb/s
[   70.467890] Strings selftest: memcpy(src+4, dst+0): 227 Mb/s
[   70.491305] Strings selftest: memcpy(src+4, dst+1): 225 Mb/s
[   70.514641] Strings selftest: memcpy(src+4, dst+2): 226 Mb/s
[   70.537944] Strings selftest: memcpy(src+4, dst+3): 227 Mb/s
[   70.560565] Strings selftest: memcpy(src+4, dst+4): 236 Mb/s
[   70.583914] Strings selftest: memcpy(src+4, dst+5): 225 Mb/s
[   70.607209] Strings selftest: memcpy(src+4, dst+6): 227 Mb/s
[   70.630602] Strings selftest: memcpy(src+4, dst+7): 225 Mb/s
[   70.653941] Strings selftest: memcpy(src+5, dst+0): 226 Mb/s
[   70.677238] Strings selftest: memcpy(src+5, dst+1): 227 Mb/s
[   70.700576] Strings selftest: memcpy(src+5, dst+2): 226 Mb/s
[   70.723944] Strings selftest: memcpy(src+5, dst+3): 225 Mb/s
[   70.747327] Strings selftest: memcpy(src+5, dst+4): 226 Mb/s
[   70.769949] Strings selftest: memcpy(src+5, dst+5): 236 Mb/s
[   70.793287] Strings selftest: memcpy(src+5, dst+6): 226 Mb/s
[   70.816585] Strings selftest: memcpy(src+5, dst+7): 227 Mb/s
[   70.839973] Strings selftest: memcpy(src+6, dst+0): 225 Mb/s
[   70.863334] Strings selftest: memcpy(src+6, dst+1): 225 Mb/s
[   70.886604] Strings selftest: memcpy(src+6, dst+2): 227 Mb/s
[   70.909953] Strings selftest: memcpy(src+6, dst+3): 226 Mb/s
[   70.933301] Strings selftest: memcpy(src+6, dst+4): 225 Mb/s
[   70.956558] Strings selftest: memcpy(src+6, dst+5): 227 Mb/s
[   70.979147] Strings selftest: memcpy(src+6, dst+6): 236 Mb/s
[   71.002507] Strings selftest: memcpy(src+6, dst+7): 225 Mb/s
[   71.025827] Strings selftest: memcpy(src+7, dst+0): 226 Mb/s
[   71.049101] Strings selftest: memcpy(src+7, dst+1): 227 Mb/s
[   71.072444] Strings selftest: memcpy(src+7, dst+2): 226 Mb/s
[   71.095817] Strings selftest: memcpy(src+7, dst+3): 225 Mb/s
[   71.119118] Strings selftest: memcpy(src+7, dst+4): 227 Mb/s
[   71.142458] Strings selftest: memcpy(src+7, dst+5): 226 Mb/s
[   71.165795] Strings selftest: memcpy(src+7, dst+6): 226 Mb/s
[   71.188308] Strings selftest: memcpy(src+7, dst+7): 237 Mb/s
[   71.194013] Strings selftest: testing with size: 4194304
[   71.217246] Strings selftest: memset(dst+0): 226 Mb/s
[   71.238157] Strings selftest: memset(dst+1): 252 Mb/s
[   71.258795] Strings selftest: memset(dst+2): 257 Mb/s
[   71.279346] Strings selftest: memset(dst+3): 257 Mb/s
[   71.299862] Strings selftest: memset(dst+4): 258 Mb/s
[   71.320378] Strings selftest: memset(dst+5): 258 Mb/s
[   71.340886] Strings selftest: memset(dst+6): 258 Mb/s
[   71.361419] Strings selftest: memset(dst+7): 258 Mb/s
[   71.366475] String selftests succeeded
avpatel commented 3 years ago

Most RISC-V implementations don't have HW misaligned load/store available so on these implementations OpenSBI will get trap-n-emulate for misaligned load/store using unprivileged access (MSTATUS.MPRV).

Regards, Anup

teknoraver commented 3 years ago

@avpatel I'm doing aligned accesses. Unaligned accesses are 20x slower

size:           1024 Mb
read size:      64 bit
unalignment:    1 byte
elapsed time:   93.46 sec
throughput:     10.96 Mb/s
avpatel commented 3 years ago

@teknoraver I had similar observation in-past for Raspi[1|2] boards. This may be also related to slower DRAM (i.e. LPDDR) or some other system configuration of low-cost boards.

BTW, with SBI v0.3 spec we now have SBI PMU firmware events which can be used to know amount of misaligned load/store handling, remote TLB flushes, etc done by OpenSBI

Regards, Anup

geertu commented 3 years ago

@teknoraver

That are the same numbers that I measured while writing the C memcpy. I made this kernel code:

The tool I've been using for the past +20 years is at https://gist.github.com/geertu/f86486959f68b2337bd395d41a5d832c

read: 255 MiB/s write: 294 MiB/s copy: 18 MiB/s

I'm a bit surprised by the copy performance, which is much less than what you got. Usually it lies in the range of 50-100% of the minimum of read and write performance.

pdp7 commented 3 years ago

@davidlt any suggestions how we could improve performance?

davidlt commented 3 years ago

Don't know what's causing this or if this is HW bottleneck.

strlcat commented 1 year ago

Hi,

Things now much better for VisionFive2's JH7110:

user@starfive:~$ ./unalign_check -r1
size:       100 Mb
read size:  8 bit
unalignment:    0 byte
elapsed time:   0.29 sec
throughput: 340.97 Mb/s
user@starfive:~$ ./unalign_check -r8
size:       100 Mb
read size:  64 bit
unalignment:    0 byte
elapsed time:   0.05 sec
throughput: 2076.48 Mb/s
user@starfive:~$ ./unalign_check -w1
size:       100 Mb
write size: 8 bit
unalignment:    0 byte
elapsed time:   0.29 sec
throughput: 340.09 Mb/s
user@starfive:~$ ./unalign_check -w8
size:       100 Mb
write size: 64 bit
unalignment:    0 byte
elapsed time:   0.13 sec
throughput: 754.71 Mb/s

But what's about JH7100?

esmil commented 1 year ago

The version of the U74 cores on the old JH700 has a very slow memory controller just like the U74 cores in the FU740 chip on the Unleashed board.

On top of that the first BeagleV Starlight boards shipped with a version of the "SecondBoot" initialization code that didn't enable some bits. On the BeagleV board memory access can be speeded up to the (still slow) level of the VisionFive V1 by updating SecondBoot.