jackaudio / jack2

jack2 codebase
GNU General Public License v2.0
2.19k stars 374 forks source link

Removal of cycle counters and performance on embedded systems #189

Open fl4p opened 8 years ago

fl4p commented 8 years ago

Why are you moving away from the cycle counters? clock_gettime is damn slow, as this CPU profile shows (ARM embedded system): profile01 I noticed a significant decrease in JACK's CPU load when using hardware-provided ARM high precision counters.

Also I think the CalcCPULoad() adds some extra overhead (in the profile too), if no client is querying the CPU load value. This should be on-demand, or at least with a configure-switch.

What do you think?

sletz commented 8 years ago

Le 12 f�vr. 2016 � 21:26, Fabian notifications@github.com a �crit :

Why are you moving away from the cycle counters? clock_gettime is damn slow, as this CPU profile shows (ARM embedded system):

I noticed a significant decrease in JACK's CPU load when using hardware-provided ARM high precision counters.

Any patch so show?

Also I think the CalcCPULoad() adds some extra overhead, if not client is using querying the CPU load value. This should be on-demand, or at least with a configure-switch.

What do you think?

I don't think this make real sense.

falkTX commented 8 years ago

I think this is a better question for the JACK mailing list. Note that both jack1 and jack2 removed cycle counters as a timing option, not just jack2.

How big is this "significant decrease"?

fl4p commented 8 years ago

This is the High Precision counter code:

#include <sys/mman.h>
#include <sys/stat.h> 
#include <fcntl.h>

static long long int *arm_hpet_ptr;

static int arm_hpet_init ()
{
    int fd;
    void *st_base;

    if (-1 == (fd = open("/dev/mem", O_RDONLY))) {
        printf ("Cannot access /dev/mem (%s)\n", strerror (errno));
        return -1;
    }

    if (MAP_FAILED == (st_base = mmap(NULL, 4096,
                        PROT_READ, MAP_SHARED, fd, ARM_HPET_ST_BASE))) {
        printf ("mmap() failed.\n");
        return -1;
    }

    arm_hpet_ptr = (long long int *)((char *)st_base + ARM_HPET_TIMER_OFFSET);

    return 0;
}

static uint64_t cycles_arm_hpet (void)
{
    static int init = 1;
    if(init) {arm_hpet_init (); init = 0; }
    return *arm_hpet_ptr; // 1mhz counter => 1µs cycle
}

Addresses for the Raspberry Pi 1 and 2:

#define BCM2708_PERI_BASE        0x20000000  // rpi1
#define BCM2709_PERI_BASE        0x3F000000 // rpi2
#define ARM_PERI_BASE BCM2709_PERI_BASE // choose rpi2
#define ARM_HPET_ST_BASE                  (ARM_PERI_BASE + 0x3000)
#define ARM_HPET_TIMER_OFFSET (4)
#define ARM_HPET_TIMER_RATE 1000000

The cycles_arm_hpet can just be placed as _jack_get_microseconds.

I was benchmarking timers a bit and noticed the gettimeofday is twice as fast as clock_gettime:

        gettimeofday 10000000x took 2002.2 ms,  200.2 ns/call
       clock_gettime 10000000x took 4274.5 ms,  427.5 ns/call
     cycles_arm_hpet 10000000x took 1943.5 ms,  194.4 ns/call
         cycles_arm7 10000000x took  327.3 ms,   32.7 ns/call

gettimeofday is as accurate as clock_gettime on the Raspberry Pi 2. I will get back with some hopefully meaningful benchmarks on CPU load in jack2.

falkTX commented 8 years ago

forgot about this...

I did some testing of my own. The cycle counter is only useful if you use jack in a single cpu. The counter for each CPU is not in sync, and due to jack2 using SMP the thread that calls get_cycles might change at anytime.

I agree the cycle counter uses less resources, but it's not possible to use in SMP systems.