intel / qatlib

Other
93 stars 34 forks source link

The numa allocation mechanism in version 27.09 is still questionable #95

Open Microcarft opened 1 month ago

Microcarft commented 1 month ago

I was very happy to see the new version update, and finally solved the numa problem that gave me a headache. I immediately uninstalled the 24.02 version of qatlib and qatmgr, re-pulled the 24.09 version and re-make compiled and installed. However, it was disappointing that after I recompiled the application based on the new qatlib static library, and restricted the application to the core of CPU1 and the numa node1 corresponding to CPU1 by using cgruop, the application still used the QAT#0 device, which is the QAT hardware accelerator of CPU0. My sysconfig/qat is configured as follows:

POLICY=1
ServicesEnabled=dc

Please tell me what is missing? Or is this a bug.

fionatrahe commented 1 month ago

Ah! Thanks for the feedback. That wasn’t the use-case we were solving for, but we will take it on board and investigate. In the meantime, would the following be a workaround for you?

Pick up one VF from each PF, so you have VFs on each socket, by setting POLICY=0 Use these to find the devices on the numaNode you’re interested in: CpaStatus cpaGetNumDevices (Cpa16U numDevices); CpaStatus cpaGetDeviceInfo (Cpa16U device, CpaDeviceInfo deviceInfo);

Although this will result in some VF resources being unused, that doesn’t necessarily limit performance, it depends on your use-case. If you run processes on all numaNodes then you can still use the full compute power of all devices.

Microcarft commented 1 month ago

Very happy to receive your reply! In fact, I only plan to run the application on a specific CPU (so it should only run on the NUMA node to which the specified CPU belongs, and also should only use the QAT hardware of the specified CPU). In the scenario of the above problem, I bind the application to CPU1 through cgroup, and the memory node is also the same, but the QAT hardware is still using CPU0. Although I know that if I want to run multiple applications in the future, sooner or later I will have to use all the PFs on the system. At this time, setting POLICY to 0 or a higher value will be more suitable for multi-process performance. However, in my current needs, I only hope that an application uses only one VF (that is, POLICY = 1), so that I can easily determine whether QATmgr allocates QAT#0 or QAT#1 VF to this application. In addition, I use pcm-accel -qat to view the usage of the two physical engines of QAT. pcm is a very useful performance monitoring tool, I think it is more intuitive and convenient. Looking forward to your reply, thank you~

Microcarft commented 1 month ago

Oops, sorry, I clicked the wrong button, this question is not answered.

fionatrahe commented 1 month ago

If you only want to try out QATs on socket 1 and you're not concerned with wasting the QAT resources on socket 0, then you could set POLICY=5. The first 4 VF devices the process gets will be from socket 0, if you use the 5th it should be on socket 1.

Unfortunately at the moment we don’t have configuration to allow processes to only pick up devices from the socket the process is running on.