Closed tong1heng closed 1 week ago
CAT does not affect reads (cache hits). Data already allocated in cache can be ready by any core, regardless of CAT policy. So the task running on core 0 can access all LLC ways. Your CAT policy will only take affect on a cache miss. When the newly requested data is allocated in cache, it will be allocated in the assigned LLC ways (2 ways for core 0 in this case).
This is something to consider when experimenting with CAT. If the task is already running when you configure your CAT policy, it may have data already allocated in cache and CAT may not have the expected effect. To ensure CAT has full effect, you should configure CAT before starting the task on core 0 so all data gets allocated in the assigned LLC ways.
Thanks for your reply.
And there is a new question: Does CAT has an effect on the mapping between cache address and memory address?
Here is an example for my question.
When I don't use my CAT policy, the task running on core 0 may access some memory data that mapped to another 9 ways (i.e. 0x7fc).
If I use my CAT policy, I allocate 2 ways(i.e. 0x003) for core 0. When the task running on core 0 access some memory data that mapped to another 9 ways, what will happen?
Can these 2 ways get allocated for these particular data? Or these data will always cause cache miss (i.e. they can't be allocated in the 2 ways of LLC)?
Does CAT has an effect on the mapping between cache address and memory address?
No
If I use my CAT policy, I allocate 2 ways(i.e. 0x003) for core 0. When the task running on core 0 access some memory data that mapped to another 9 ways, what will happen?
If the data is already in the cache, CAT will have no impact. Cores can always access data in any cache way. If the data is not in cache, it will be allocated in the 2 ways (0x3) assigned to core 0.
Can these 2 ways get allocated for these particular data? Or these data will always cause cache miss (i.e. they can't be allocated in the 2 ways of LLC)?
As mentioned above, all data in the cache can be accessed by any core (no cache miss). In order to re-allocate this data in ways 0x3, it should be flushed / evicted from the cache and re-loaded by core 0 into the correct ways.
Thanks for your reply. I have understood what you explained above.
Consider one real-world scenario: I'm trying to pin some frequently-accessed data to LLC (i.e. let these data stay in LLC as far as possible) in order to improve the performance. And I have designed a demo to see the result by using CAT policy.
First, I check my LLC metadata.
$ cat /sys/devices/system/cpu/cpu0/cache/index3/number_of_sets
36864
$ cat /sys/devices/system/cpu/cpu0/cache/index3/ways_of_associativity
11
$ cat /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size
64
Then, I set my CAT policy.
sudo pqos -e "llc@0:0=0x7fc;llc@0:1=0x003"
sudo pqos -a "llc:1=0"
Finally, I run my test code. And here is my test code:
#include <iostream>
#include <vector>
#include <ctime>
#include <unistd.h>
std::vector<int> vec; // data need to be pinned
const int vecSize = 100000; // data size 100000*4B
const int eps = 50; // test read times
#pragma GCC push_options // set -O0 to ensure no compiler optimization on loop
#pragma GCC optimize ("O0")
void load() {
int a;
for(auto v:vec) {
a = v;
}
}
void test() {
int a;
for(int i = 0; i < eps; i++) { // test
double start= clock();
for(auto v:vec) {
a = v;
}
double end = clock();
std::cout << "No." << i << " total clock number: " << end - start << "\n"; // print the clock number
}
}
#pragma GCC pop_options
int main() {
// data initialization
vec.clear();
vec.reserve(vecSize);
for(int i = 0; i < vecSize; i++) {
vec.emplace_back(i * 100 + 123456);
}
int pid = fork();
if(pid < 0) {
std::cout << "create process fail!\n";
exit(EXIT_FAILURE);
}
if(pid == 0) { // child process load data to LLC
std::cout << "child process pid = " << getpid() << "\n";
// bind task to core 0 on socket0
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
sched_setaffinity(0, sizeof(mask), &mask);
std::cout << "child process begin to load\n";
for(int i = 0; i < 50; i++) { // load 50 times to make fully loading
load();
}
while(1) { } // occupy core0's allocated LLC ways
}
else { // parent process perform test
// bind task to core 2 on socket0
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(2, &mask);
sched_setaffinity(0, sizeof(mask), &mask);
sleep(1); // wait for child process to load data
std::cout << "parent process begin to test\n";
test();
}
return 0;
}
However, the result doesn't reach my expectation. It seems that these data are not pinned to the allocated 2 ways (i.e. 0x003) of LLC.
$ ./test
child process pid = 1885537
child process begin to load
parent process begin to test
No.0 total clock number: 2916
No.1 total clock number: 2872
No.2 total clock number: 2879
No.3 total clock number: 2869
No.4 total clock number: 2174
No.5 total clock number: 1197
No.6 total clock number: 1194
No.7 total clock number: 1195
No.8 total clock number: 1193
No.9 total clock number: 1198
No.10 total clock number: 1194
No.11 total clock number: 1196
No.12 total clock number: 1123
No.13 total clock number: 917
No.14 total clock number: 858
No.15 total clock number: 896
No.16 total clock number: 863
No.17 total clock number: 1030
No.18 total clock number: 928
No.19 total clock number: 927
No.20 total clock number: 902
No.21 total clock number: 1009
No.22 total clock number: 931
No.23 total clock number: 988
No.24 total clock number: 926
No.25 total clock number: 903
No.26 total clock number: 862
No.27 total clock number: 798
No.28 total clock number: 790
No.29 total clock number: 823
No.30 total clock number: 926
No.31 total clock number: 830
No.32 total clock number: 799
No.33 total clock number: 796
No.34 total clock number: 910
No.35 total clock number: 998
No.36 total clock number: 958
No.37 total clock number: 1013
No.38 total clock number: 980
No.39 total clock number: 1018
No.40 total clock number: 981
No.41 total clock number: 1012
No.42 total clock number: 1043
No.43 total clock number: 1068
No.44 total clock number: 976
No.45 total clock number: 1016
No.46 total clock number: 1044
No.47 total clock number: 1038
No.48 total clock number: 871
No.49 total clock number: 904
For fair comparison, I also test the clock number of traversing the same data 50 times without CAT policy. The result is quite similar with the above result. In the beginning, the clock number is about 3000 because of full cache miss. As the access time increases, the clock number drops to about 1000 because of a part of cache hit.
The test result that I expect is that, in the beginning of the 50 times loop, the clock number should be about 1000 rather than 3000 because these data should have been loaded to LLC by the tasks running on core0 and can lead to cache hit for the tasks running on core2.
I have tried to modify the data size and the number of allocated ways in CAT policy. But the result is still not good.
I have considered some reasons for the abnormal testing result. For example, the instruction cache of LLC may have an impact on the data cache of LLC. And when loading the array data, it can't be fully loaded to LLC because of the mapping between memory address and cache address.
But I'm not sure whether there is something wrong in my test code or in my use of CAT.
Could you please give me a help on this problem?
Besides, is there some other feasible way to pin some data to cache by using some tools like intel-cmt-cat?
Thanks a lot!
Your data set in the above example is not large enough to start filling into LLC, so it will operate from L2 cache and CAT will have no impact. Try increasing the dataset to fill L2 and the 2 LLC ways e.g. 6 or 7 MB.
Hi, I'm trying intel-cmt-cat recently to pin some data to LLC and I'd like to ask a question about the implementation of CAT here.
The LLC on my CPU socket0 has 11 ways (i.e. the default mask is 0x7ff).
I first set COS0 and COS1 on socket0:
sudo pqos -e "llc@0:0=0x7fc;llc@0:1=0x003"
Then I associate core 0 with COS 1 and other cores with COS0 (default):
sudo pqos -a "llc:1=0"
The configuration can be seen that:
I just want to know that whether the task running on core 0 can access all LLC on socket 0 or just the associated 2 LLC ways on socket 0 when the task tries to read some data from LLC (i.e. cache hit). In other words, I mean that the CAT strategy only decides the LLC isolation when tring to evict some old data because of cache miss and doesn't influence the access pattern of LLC.
Or the CAT strategy can both decides the isolation of LLC when doing read and write operations on LLC.
Looking forward to your reply. Thanks a lot!