t-crest / patmos

Patmos is a time-predictable VLIW processor, and the processor for the T-CREST project
http://patmos.compute.dtu.dk
BSD 2-Clause "Simplified" License
134 stars 72 forks source link

Strange timing behavior in `patemu` when executing same code on multiple cores #105

Closed michael-platzer closed 2 years ago

michael-platzer commented 2 years ago

I am observing a behavior that I cannot explain when executing the same code on multiple cores in patemu. I was expecting that more cores executing the same code in parallel would slow down the execution on each core due to the increased number of memory accesses. However, the opposite seems to happen with code executed on just one core taking longer to execute than if the same code is run in parallel on multiple cores.

Steps to reproduce (starting from a clean build of the most recent version, i.e., commit 4e8f8d9):

  1. Edit patmos/hardware/config/altde2-115.xml and uncomment lines 7 and 8 as follows to build a multicore system (note that Argo is not used since it produces an error, see #104 ):
    <!-- Default is single core -->
    <pipeline dual="false" />
    <cores count="8"/>
    <!--<CmpDevs>
    <CmpDev name="Argo" />
    </CmpDevs> -->
  2. After rebuilding with misc/build.sh, save the following code to a file called test.c (sorry for the huge file, I tried to keep it as short as possible while still being able to reproduce the behavior):
    
    #include <stdio.h>
    #include <stdint.h>
    #include <machine/patmos.h>
    #include <machine/rtc.h>

include "libcorethread/corethread.h"

void test(uint8_t *data) { int round; for (round = 1; 1; round++) { { uint8_t tmp = data[0]; int i; for (i = 0; i < 15; i++) { data[i] = data[i+1]; } data[15] = tmp; }

    if (round == 10)
        break;

    {
        int i, j;
        for (i = 0; i < 4; i++) {
            uint8_t *col = data + (i * 4);
            uint8_t copy[4];
            for (j = 0; j < 4; j++) {
                copy[j] = col[j];
            }
            col[0] = copy[3] ^ copy[2] ^ copy[1];
            col[1] = copy[0] ^ copy[3] ^ copy[2];
            col[2] = copy[1] ^ copy[0] ^ copy[3];
            col[3] = copy[2] ^ copy[1] ^ copy[0];
        }
    }
}

}

static uint8_t test_data[MAX_CPUS * 16];

volatile _UNCACHED static unsigned long long t_start[MAX_CPUS]; volatile _UNCACHED static unsigned long long t_end [MAX_CPUS];

void work(void* arg) { int cpuid = get_cpuid();

unsigned long long start, end;

// have all CPUs start roughly at the same time
while ((start = get_cpu_cycles()) < 1000000)
    ;

test(((uint8_t *)test_data) + (cpuid * 16));

end = get_cpu_cycles();

// wait for other CPUs to finish
while (get_cpu_cycles() < end + 1000000)
    ;

t_start[cpuid] = start;
t_end  [cpuid] = end;

}

int main() { int i, core_cnt = get_cpucnt(); if (core_cnt > MAX_CPUS) { core_cnt = MAX_CPUS; }

printf("Starting threads on %d CPUs\n", core_cnt);

for (i = 1; i < core_cnt; i++) {
    corethread_create(i, &work, NULL);
}

work(NULL);

int ret;
for (i = 1; i < core_cnt; i++) {
    corethread_join(i, (void *)&ret);
}

printf("Threads joined\n");
for (i = 0; i < core_cnt; i++) {
    printf(" start time: %llu, duration: %llu\n", t_start[i], t_end[i] - t_start[i]);
}
return 0;

}

3. Run the following commands to compile and execute using `patemu`:

$ patmos-clang -O3 test.c libcorethread/corethread.c -DMAX_CPUS=1 $ patemu a.out Starting threads on 1 CPUs Threads joined start time: 1000005, duration: 108301 $ patmos-clang -O3 test.c libcorethread/corethread.c -DMAX_CPUS=2 $ patemu a.out Starting threads on 2 CPUs Threads joined start time: 1000000, duration: 92199 start time: 1000000, duration: 92220 $ patmos-clang -O3 test.c libcorethread/corethread.c -DMAX_CPUS=4 $ patemu a.out Starting threads on 4 CPUs Threads joined start time: 1000000, duration: 92199 start time: 1000000, duration: 92220 start time: 1000005, duration: 92236 start time: 1000002, duration: 92260



For the last command, four CPUs execute the same code. The execution time for that code should be higher or at least equal compared to just one CPU executing the code. However, the execution time on a single CPU is significantly higher than one multiple CPUs. Note that the execution time does not vary as more CPUs are added, there is only a huge difference between one and multiple CPUs.

Disclaimer: I assume that the value returned by `get_cpucnt()` is equal when called simultaneously by two cores, please correct me if this is wrong.
schoeberl commented 2 years ago

We use a TDM based memory arbitration so that there are no timing inferences between cores. You are using the 4-core version for all tests, right? Therefore, it should not matter if you run it on one or on four cores. If you want to see a difference, you could use a round-robin memory arbiter.

Why the first run is a bit slower, I cannot explain.

michael-platzer commented 2 years ago

Ok, so I would only see timing differences if I reconfigure the system to have a different number of cores and then rebuild everything? That explains why the timing is the same for cores 0 and 1 between runs 2 and 3.

However, that means that core 0 should always execute the code in the same amount of cycles, even if no other cores are active, which apparently is not the case.

schoeberl commented 2 years ago

Yes, to the first question. To use a single core at max memory bandwidth you need to build a single core system.

As stated above, I have no idea why the single threaded version of you benchmark is slightly slower. I tried to think about your code, but could not find out why this is so.

michael-platzer commented 2 years ago

I have been able to narrow down this behavior a little further. It is caused by optimizations done by the compiler when using -O3 and due to the fact that MAX_CPUS is a compile-time constant, which allows the compiler to remove any calls to corethread_create(). The effect disappears when using a volatile integer instead of a macro for the CPU count, e.g.:

static volatile int max_cpus = 1;

With this change the cycle count for CPU 0 is the same irrespective of the number of cores running the code, even for just one CPU. I still do not fully understand what is going on, but I am closing this issue since the source of the behavior apparently lies in the software and not in the hardware.

schoeberl commented 2 years ago

Thanks for looking further into it. However, I still don't understand this as ```corethread_create()''' should have no influence on the time measurement.