raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.14k stars 4.99k forks source link

memcpy performance issue #3480

Open zoff99 opened 4 years ago

zoff99 commented 4 years ago

Describe the bug memcpy take a long time (see example program) can i do something to speed this up? alignment?

To reproduce make a && ./a

Expected behaviour hopfully be faster

Actual behaviour takes up to 30ms on a rpi4

System Copy and paste the results of the raspinfo command in to this section. Alternatively, copy and paste a pastebin link, or add answers to the following questions:

Additional context

#include <time.h>
#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>
#include <string.h>

static void __utimer_start(struct timeval *tm1)
{
    gettimeofday(tm1, NULL);
}

static unsigned long long __utimer_stop(struct timeval *tm1)
{
    struct timeval tm2;
    gettimeofday(&tm2, NULL);
    unsigned long long t = 1000 * (tm2.tv_sec - tm1->tv_sec) + (tm2.tv_usec - tm1->tv_usec) / 1000;

    printf("%llu ms\n", t);
    return t;
}

int main()
{
    int size1 = 30000000;
    int size2 = 8*1024*1024;

    char *buf = calloc(1, size1);
    char *buf2 = calloc(1, size2);

    struct timeval tm_01_007;
    __utimer_start(&tm_01_007);

    memcpy(buf, buf2, size2);
    memcpy(buf, buf2, size2);
    memcpy(buf, buf2, size2);

    long long timspan_in_ms;
    timspan_in_ms = __utimer_stop(&tm_01_007);

    free(buf);
    free(buf2);

    return 0;
}

save as a.c

JamesH65 commented 4 years ago

I believe that these functions are already highly optimized. Why do you think they can be improved?

pelwell commented 4 years ago

The default malloc alignment ought to be sensible, but feel free to align it further (256 bytes should be more then enough).

  1. Have you set arm_freq in config.txt?
  2. Have you enabled the Performance governor? If not, performance will be reduced. Alternatively, add force_turbo=1 to config.txt.
  3. What displays are attached (and with what resolution and frame rate)?
  4. Is there any other activity occurring at the time?
zoff99 commented 4 years ago

@JamesH65 because it seems to slow. rpi4 should be able to memcpy some GB/s

JamesH65 commented 4 years ago

Are you sure that your timer functions are accurate - I'm not familiar with __utimer_start and google seems to think it an android function. Might be worth trying stuff from this page.https://stackoverflow.com/questions/6749621/how-to-create-a-high-resolution-timer-in-linux-to-measure-program-performance

pelwell commented 4 years ago

__utimer_start is simply a wrapper around gettimeofday, which is a POSIX function.

I think it's important to establish the clock speeds, and any memory bandwidth taken up by the display, etc.

zoff99 commented 4 years ago

the function is included in the source attached here, its just a wrapper. with a timespan of 30ms gettimeofday is accurate enough.

can somebody try the attached source on their rpi and just post the result?

JamesH65 commented 4 years ago

Tried it, increased the number of copies to 10, and it completed in 34ms, so 3.4ms per copy. With 20 it was 60ms, so 3ms per copy. That's at 600MHz on a Pi4.

pelwell commented 4 years ago

With performance governor enabled (i.e. with the ARMs at 1.5GHz) I get a fairly consistent 3.3GB/s bandwidth. The powersave governor (600MHz) drops performance to 1.46GB/s. On-demand will yield a result somewhere between the two.

JamesH65 commented 4 years ago

Which is about what I was seeing (approx 2.5GBits/s)

pelwell commented 4 years ago

GB/s is Giga Bytes per second.

N.B. My results were with the screen blanked - an active display will eat into that bandwidth (1080p 60-70 drops the bandwidth to about 3.1GB/s.

JamesH65 commented 4 years ago

Apologies, I've use the wrong units - that should read 2.5GBytes/s.

pelwell commented 4 years ago

N.B.2. My figures are for reading and writing, so you could naively double the results. In practise, read and write speeds are different, so two separate figures is more useful.

zoff99 commented 4 years ago

thanks guys for your results. i will try to change to the performance governor

pelwell commented 4 years ago
$ sudo sh -c "echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor"

should do it.

AllNamesRTaken commented 3 years ago

A year later since it has not been closed I found this when researching why the memory bandwidth is resulting in bad webGL/OpenGl performance for large windows. I.E. it scales down dramatically as the window grows even though a 1080p swap should only result in about 0.5 GB/s and the memory bandwidth should be 4-5GB/s

Maybe you could boot your PI in console only and rerun the test?