A question about the implementation of intel-cmt-cat

Hi, I'm trying intel-cmt-cat recently to pin some data to LLC and I'd like to ask a question about the implementation of CAT here.

The LLC on my CPU socket0 has 11 ways (i.e. the default mask is 0x7ff).

I first set COS0 and COS1 on socket0: sudo pqos -e "llc@0:0=0x7fc;llc@0:1=0x003"

Then I associate core 0 with COS 1 and other cores with COS0 (default): sudo pqos -a "llc:1=0"

The configuration can be seen that:

$ sudo pqos -s
NOTE:  Mixed use of MSR and kernel interfaces to manage
       CAT or CMT & MBM may lead to unexpected behavior.
L3CA/MBA COS definitions for Socket 0:
    L3CA COS0 => MASK 0x7fc
    L3CA COS1 => MASK 0x3
    L3CA COS2 => MASK 0x7ff
    L3CA COS3 => MASK 0x7ff
    L3CA COS4 => MASK 0x7ff
    L3CA COS5 => MASK 0x7ff
    L3CA COS6 => MASK 0x7ff
    L3CA COS7 => MASK 0x7ff
    L3CA COS8 => MASK 0x7ff
    L3CA COS9 => MASK 0x7ff
    L3CA COS10 => MASK 0x7ff
    L3CA COS11 => MASK 0x7ff
    L3CA COS12 => MASK 0x7ff
    L3CA COS13 => MASK 0x7ff
    L3CA COS14 => MASK 0x7ff
    L3CA COS15 => MASK 0x7ff
    MBA COS0 => 100% available
    MBA COS1 => 100% available
    MBA COS2 => 100% available
    MBA COS3 => 100% available
    MBA COS4 => 100% available
    MBA COS5 => 100% available
    MBA COS6 => 100% available
    MBA COS7 => 100% available
L3CA/MBA COS definitions for Socket 1:
    L3CA COS0 => MASK 0x7ff
    L3CA COS1 => MASK 0x7ff
    L3CA COS2 => MASK 0x7ff
    L3CA COS3 => MASK 0x7ff
    L3CA COS4 => MASK 0x7ff
    L3CA COS5 => MASK 0x7ff
    L3CA COS6 => MASK 0x7ff
    L3CA COS7 => MASK 0x7ff
    L3CA COS8 => MASK 0x7ff
    L3CA COS9 => MASK 0x7ff
    L3CA COS10 => MASK 0x7ff
    L3CA COS11 => MASK 0x7ff
    L3CA COS12 => MASK 0x7ff
    L3CA COS13 => MASK 0x7ff
    L3CA COS14 => MASK 0x7ff
    L3CA COS15 => MASK 0x7ff
    MBA COS0 => 100% available
    MBA COS1 => 100% available
    MBA COS2 => 100% available
    MBA COS3 => 100% available
    MBA COS4 => 100% available
    MBA COS5 => 100% available
    MBA COS6 => 100% available
    MBA COS7 => 100% available
Core information for socket 0:
    Core 0, L2ID 0, L3ID 0 => COS1, RMID0
    Core 2, L2ID 4, L3ID 0 => COS0, RMID0
    Core 4, L2ID 1, L3ID 0 => COS0, RMID0
    Core 6, L2ID 3, L3ID 0 => COS0, RMID0
    Core 8, L2ID 2, L3ID 0 => COS0, RMID0
    Core 10, L2ID 11, L3ID 0 => COS0, RMID0
    Core 12, L2ID 8, L3ID 0 => COS0, RMID0
    Core 14, L2ID 10, L3ID 0 => COS0, RMID0
    Core 16, L2ID 9, L3ID 0 => COS0, RMID0
    Core 18, L2ID 20, L3ID 0 => COS0, RMID0
    Core 20, L2ID 16, L3ID 0 => COS0, RMID0
    Core 22, L2ID 19, L3ID 0 => COS0, RMID0
    Core 24, L2ID 17, L3ID 0 => COS0, RMID0
    Core 26, L2ID 18, L3ID 0 => COS0, RMID0
    Core 28, L2ID 24, L3ID 0 => COS0, RMID0
    Core 30, L2ID 27, L3ID 0 => COS0, RMID0
    Core 32, L2ID 25, L3ID 0 => COS0, RMID0
    Core 34, L2ID 26, L3ID 0 => COS0, RMID0
    Core 36, L2ID 0, L3ID 0 => COS0, RMID0
    Core 38, L2ID 4, L3ID 0 => COS0, RMID0
    Core 40, L2ID 1, L3ID 0 => COS0, RMID0
    Core 42, L2ID 3, L3ID 0 => COS0, RMID0
    Core 44, L2ID 2, L3ID 0 => COS0, RMID0
    Core 46, L2ID 11, L3ID 0 => COS0, RMID0
    Core 48, L2ID 8, L3ID 0 => COS0, RMID0
    Core 50, L2ID 10, L3ID 0 => COS0, RMID0
    Core 52, L2ID 9, L3ID 0 => COS0, RMID0
    Core 54, L2ID 20, L3ID 0 => COS0, RMID0
    Core 56, L2ID 16, L3ID 0 => COS0, RMID0
    Core 58, L2ID 19, L3ID 0 => COS0, RMID0
    Core 60, L2ID 17, L3ID 0 => COS0, RMID0
    Core 62, L2ID 18, L3ID 0 => COS0, RMID0
    Core 64, L2ID 24, L3ID 0 => COS0, RMID0
    Core 66, L2ID 27, L3ID 0 => COS0, RMID0
    Core 68, L2ID 25, L3ID 0 => COS0, RMID0
    Core 70, L2ID 26, L3ID 0 => COS0, RMID0
Core information for socket 1:
    Core 1, L2ID 32, L3ID 1 => COS0, RMID0
    Core 3, L2ID 36, L3ID 1 => COS0, RMID0
    Core 5, L2ID 33, L3ID 1 => COS0, RMID0
    Core 7, L2ID 35, L3ID 1 => COS0, RMID0
    Core 9, L2ID 34, L3ID 1 => COS0, RMID0
    Core 11, L2ID 43, L3ID 1 => COS0, RMID0
    Core 13, L2ID 40, L3ID 1 => COS0, RMID0
    Core 15, L2ID 42, L3ID 1 => COS0, RMID0
    Core 17, L2ID 41, L3ID 1 => COS0, RMID0
    Core 19, L2ID 52, L3ID 1 => COS0, RMID0
    Core 21, L2ID 48, L3ID 1 => COS0, RMID0
    Core 23, L2ID 51, L3ID 1 => COS0, RMID0
    Core 25, L2ID 49, L3ID 1 => COS0, RMID0
    Core 27, L2ID 50, L3ID 1 => COS0, RMID0
    Core 29, L2ID 56, L3ID 1 => COS0, RMID0
    Core 31, L2ID 59, L3ID 1 => COS0, RMID0
    Core 33, L2ID 57, L3ID 1 => COS0, RMID0
    Core 35, L2ID 58, L3ID 1 => COS0, RMID0
    Core 37, L2ID 32, L3ID 1 => COS0, RMID0
    Core 39, L2ID 36, L3ID 1 => COS0, RMID0
    Core 41, L2ID 33, L3ID 1 => COS0, RMID0
    Core 43, L2ID 35, L3ID 1 => COS0, RMID0
    Core 45, L2ID 34, L3ID 1 => COS0, RMID0
    Core 47, L2ID 43, L3ID 1 => COS0, RMID0
    Core 49, L2ID 40, L3ID 1 => COS0, RMID0
    Core 51, L2ID 42, L3ID 1 => COS0, RMID0
    Core 53, L2ID 41, L3ID 1 => COS0, RMID0
    Core 55, L2ID 52, L3ID 1 => COS0, RMID0
    Core 57, L2ID 48, L3ID 1 => COS0, RMID0
    Core 59, L2ID 51, L3ID 1 => COS0, RMID0
    Core 61, L2ID 49, L3ID 1 => COS0, RMID0
    Core 63, L2ID 50, L3ID 1 => COS0, RMID0
    Core 65, L2ID 56, L3ID 1 => COS0, RMID0
    Core 67, L2ID 59, L3ID 1 => COS0, RMID0
    Core 69, L2ID 57, L3ID 1 => COS0, RMID0
    Core 71, L2ID 58, L3ID 1 => COS0, RMID0

I just want to know that whether the task running on core 0 can access all LLC on socket 0 or just the associated 2 LLC ways on socket 0 when the task tries to read some data from LLC (i.e. cache hit). In other words, I mean that the CAT strategy only decides the LLC isolation when tring to evict some old data because of cache miss and doesn't influence the access pattern of LLC.

Or the CAT strategy can both decides the isolation of LLC when doing read and write operations on LLC.

Looking forward to your reply. Thanks a lot!

CAT does not affect reads (cache hits). Data already allocated in cache can be ready by any core, regardless of CAT policy. So the task running on core 0 can access all LLC ways. Your CAT policy will only take affect on a cache miss. When the newly requested data is allocated in cache, it will be allocated in the assigned LLC ways (2 ways for core 0 in this case).

This is something to consider when experimenting with CAT. If the task is already running when you configure your CAT policy, it may have data already allocated in cache and CAT may not have the expected effect. To ensure CAT has full effect, you should configure CAT before starting the task on core 0 so all data gets allocated in the assigned LLC ways.

Thanks for your reply.

And there is a new question: Does CAT has an effect on the mapping between cache address and memory address?

Here is an example for my question.

When I don't use my CAT policy, the task running on core 0 may access some memory data that mapped to another 9 ways (i.e. 0x7fc).

If I use my CAT policy, I allocate 2 ways(i.e. 0x003) for core 0. When the task running on core 0 access some memory data that mapped to another 9 ways, what will happen?

Can these 2 ways get allocated for these particular data? Or these data will always cause cache miss (i.e. they can't be allocated in the 2 ways of LLC)?

Does CAT has an effect on the mapping between cache address and memory address?

If I use my CAT policy, I allocate 2 ways(i.e. 0x003) for core 0. When the task running on core 0 access some memory data that mapped to another 9 ways, what will happen?

If the data is already in the cache, CAT will have no impact. Cores can always access data in any cache way. If the data is not in cache, it will be allocated in the 2 ways (0x3) assigned to core 0.

Can these 2 ways get allocated for these particular data? Or these data will always cause cache miss (i.e. they can't be allocated in the 2 ways of LLC)?

As mentioned above, all data in the cache can be accessed by any core (no cache miss). In order to re-allocate this data in ways 0x3, it should be flushed / evicted from the cache and re-loaded by core 0 into the correct ways.

Thanks for your reply. I have understood what you explained above.

Consider one real-world scenario: I'm trying to pin some frequently-accessed data to LLC (i.e. let these data stay in LLC as far as possible) in order to improve the performance. And I have designed a demo to see the result by using CAT policy.

First, I check my LLC metadata.

$ cat /sys/devices/system/cpu/cpu0/cache/index3/number_of_sets
36864
$ cat /sys/devices/system/cpu/cpu0/cache/index3/ways_of_associativity
11
$ cat /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size
64

Then, I set my CAT policy.

sudo pqos -e "llc@0:0=0x7fc;llc@0:1=0x003"
sudo pqos -a "llc:1=0"

Finally, I run my test code. And here is my test code:

#include <iostream>
#include <vector>
#include <ctime>
#include <unistd.h>

std::vector<int> vec;           // data need to be pinned
const int vecSize = 100000;     // data size 100000*4B
const int eps = 50;             // test read times

#pragma GCC push_options        // set -O0 to ensure no compiler optimization on loop
#pragma GCC optimize ("O0")
void load() {
    int a;
    for(auto v:vec) {
        a = v;
    }
}
void test() {
    int a;
    for(int i = 0; i < eps; i++) {  // test
        double start= clock();
        for(auto v:vec) {
            a = v;
        }
        double end = clock();
        std::cout << "No." << i << " total clock number: " << end - start << "\n";  // print the clock number
    }
}
#pragma GCC pop_options

int main() {
    // data initialization
    vec.clear();
    vec.reserve(vecSize);
    for(int i = 0; i < vecSize; i++) {
        vec.emplace_back(i * 100 + 123456);
    }

    int pid = fork();
    if(pid < 0) {
        std::cout << "create process fail!\n";
        exit(EXIT_FAILURE);
    }
    if(pid == 0) {  // child process load data to LLC
        std::cout << "child process pid = " << getpid() << "\n";
        // bind task to core 0 on socket0
        cpu_set_t mask;
        CPU_ZERO(&mask);
        CPU_SET(0, &mask);
        sched_setaffinity(0, sizeof(mask), &mask);
        std::cout << "child process begin to load\n";
        for(int i = 0; i < 50; i++) {   // load 50 times to make fully loading
            load();
        }

        while(1) { }    // occupy core0's allocated LLC ways
    }
    else {  // parent process perform test
        // bind task to core 2 on socket0
        cpu_set_t mask;
        CPU_ZERO(&mask);
        CPU_SET(2, &mask);
        sched_setaffinity(0, sizeof(mask), &mask);

        sleep(1);       // wait for child process to load data

        std::cout << "parent process begin to test\n";
        test();
    }

    return 0;
}

However, the result doesn't reach my expectation. It seems that these data are not pinned to the allocated 2 ways (i.e. 0x003) of LLC.

$ ./test
child process pid = 1885537
child process begin to load
parent process begin to test
No.0 total clock number: 2916
No.1 total clock number: 2872
No.2 total clock number: 2879
No.3 total clock number: 2869
No.4 total clock number: 2174
No.5 total clock number: 1197
No.6 total clock number: 1194
No.7 total clock number: 1195
No.8 total clock number: 1193
No.9 total clock number: 1198
No.10 total clock number: 1194
No.11 total clock number: 1196
No.12 total clock number: 1123
No.13 total clock number: 917
No.14 total clock number: 858
No.15 total clock number: 896
No.16 total clock number: 863
No.17 total clock number: 1030
No.18 total clock number: 928
No.19 total clock number: 927
No.20 total clock number: 902
No.21 total clock number: 1009
No.22 total clock number: 931
No.23 total clock number: 988
No.24 total clock number: 926
No.25 total clock number: 903
No.26 total clock number: 862
No.27 total clock number: 798
No.28 total clock number: 790
No.29 total clock number: 823
No.30 total clock number: 926
No.31 total clock number: 830
No.32 total clock number: 799
No.33 total clock number: 796
No.34 total clock number: 910
No.35 total clock number: 998
No.36 total clock number: 958
No.37 total clock number: 1013
No.38 total clock number: 980
No.39 total clock number: 1018
No.40 total clock number: 981
No.41 total clock number: 1012
No.42 total clock number: 1043
No.43 total clock number: 1068
No.44 total clock number: 976
No.45 total clock number: 1016
No.46 total clock number: 1044
No.47 total clock number: 1038
No.48 total clock number: 871
No.49 total clock number: 904

For fair comparison, I also test the clock number of traversing the same data 50 times without CAT policy. The result is quite similar with the above result. In the beginning, the clock number is about 3000 because of full cache miss. As the access time increases, the clock number drops to about 1000 because of a part of cache hit.

The test result that I expect is that, in the beginning of the 50 times loop, the clock number should be about 1000 rather than 3000 because these data should have been loaded to LLC by the tasks running on core0 and can lead to cache hit for the tasks running on core2.

I have tried to modify the data size and the number of allocated ways in CAT policy. But the result is still not good.

I have considered some reasons for the abnormal testing result. For example, the instruction cache of LLC may have an impact on the data cache of LLC. And when loading the array data, it can't be fully loaded to LLC because of the mapping between memory address and cache address.

But I'm not sure whether there is something wrong in my test code or in my use of CAT.

Could you please give me a help on this problem?

Besides, is there some other feasible way to pin some data to cache by using some tools like intel-cmt-cat?

Thanks a lot!

Your data set in the above example is not large enough to start filling into LLC, so it will operate from L2 cache and CAT will have no impact. Try increasing the dataset to fill L2 and the 2 LLC ways e.g. 6 or 7 MB.

intel / intel-cmt-cat

A question about the implementation of intel-cmt-cat #238