intel / QATzip

Compression Library accelerated by Intel® QuickAssist Technology
https://developer.intel.com/quickassist
Other
137 stars 51 forks source link

Performance test falls back to SW using 4940 devices #75

Closed iomartin closed 1 year ago

iomartin commented 1 year ago

Running the perf_test on a Intel Xeon Platinum 8480+, which has 2x 4940 QATs, I see that it tries to start 48 processes. However, I see via htop that 36 cores are at 100% while all others are fairly idle.

Furthermore, when I inspect result_comp I see that 12 processes achieved "good" compression throughput (5.5-6.3 Gbps), while the other 36 are very slow (~0.5 Gbps):

[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=621766, bytes=524288, 6.282508 Gbps, input_len=524288, comp_len=52302, ratio=9.975815%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=624495, bytes=524288, 6.255054 Gbps, input_len=524288, comp_len=52078, ratio=9.933090%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=624697, bytes=524288, 6.253031 Gbps, input_len=524288, comp_len=52770, ratio=10.065079%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=628878, bytes=524288, 6.211459 Gbps, input_len=524288, comp_len=52100, ratio=9.937286%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=696983, bytes=524288, 5.604513 Gbps, input_len=524288, comp_len=52171, ratio=9.950829%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=698047, bytes=524288, 5.595970 Gbps, input_len=524288, comp_len=52337, ratio=9.982491%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=696948, bytes=524288, 5.604794 Gbps, input_len=524288, comp_len=52529, ratio=10.019112%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=701191, bytes=524288, 5.570879 Gbps, input_len=524288, comp_len=53304, ratio=10.166931%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=701246, bytes=524288, 5.570442 Gbps, input_len=524288, comp_len=53694, ratio=10.241318%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=702151, bytes=524288, 5.563262 Gbps, input_len=524288, comp_len=51527, ratio=9.827995%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=702426, bytes=524288, 5.561084 Gbps, input_len=524288, comp_len=52672, ratio=10.046387%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=703488, bytes=524288, 5.552689 Gbps, input_len=524288, comp_len=52335, ratio=9.982109%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7602288, bytes=524288, 0.513826 Gbps, input_len=524288, comp_len=29569, ratio=5.639839%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7604811, bytes=524288, 0.513655 Gbps, input_len=524288, comp_len=29619, ratio=5.649376%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7610756, bytes=524288, 0.513254 Gbps, input_len=524288, comp_len=29629, ratio=5.651283%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7612705, bytes=524288, 0.513122 Gbps, input_len=524288, comp_len=29737, ratio=5.671883%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7609414, bytes=524288, 0.513344 Gbps, input_len=524288, comp_len=29511, ratio=5.628777%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7609823, bytes=524288, 0.513317 Gbps, input_len=524288, comp_len=29867, ratio=5.696678%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7600894, bytes=524288, 0.513920 Gbps, input_len=524288, comp_len=29272, ratio=5.583191%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7624332, bytes=524288, 0.512340 Gbps, input_len=524288, comp_len=29416, ratio=5.610657%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7624465, bytes=524288, 0.512331 Gbps, input_len=524288, comp_len=29566, ratio=5.639267%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7633926, bytes=524288, 0.511696 Gbps, input_len=524288, comp_len=29548, ratio=5.635834%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7633467, bytes=524288, 0.511727 Gbps, input_len=524288, comp_len=29617, ratio=5.648994%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7613688, bytes=524288, 0.513056 Gbps, input_len=524288, comp_len=29564, ratio=5.638885%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7615827, bytes=524288, 0.512912 Gbps, input_len=524288, comp_len=29444, ratio=5.615997%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7611253, bytes=524288, 0.513220 Gbps, input_len=524288, comp_len=29592, ratio=5.644226%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7650274, bytes=524288, 0.510603 Gbps, input_len=524288, comp_len=29867, ratio=5.696678%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7651132, bytes=524288, 0.510545 Gbps, input_len=524288, comp_len=29621, ratio=5.649757%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7630613, bytes=524288, 0.511918 Gbps, input_len=524288, comp_len=29566, ratio=5.639267%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7649640, bytes=524288, 0.510645 Gbps, input_len=524288, comp_len=29780, ratio=5.680084%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7637110, bytes=524288, 0.511483 Gbps, input_len=524288, comp_len=29623, ratio=5.650139%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7636562, bytes=524288, 0.511519 Gbps, input_len=524288, comp_len=29693, ratio=5.663490%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7635874, bytes=524288, 0.511566 Gbps, input_len=524288, comp_len=29768, ratio=5.677795%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7641798, bytes=524288, 0.511169 Gbps, input_len=524288, comp_len=29598, ratio=5.645370%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7642474, bytes=524288, 0.511124 Gbps, input_len=524288, comp_len=29550, ratio=5.636215%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7644509, bytes=524288, 0.510988 Gbps, input_len=524288, comp_len=29749, ratio=5.674171%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7670231, bytes=524288, 0.509274 Gbps, input_len=524288, comp_len=29758, ratio=5.675888%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7655016, bytes=524288, 0.510286 Gbps, input_len=524288, comp_len=29836, ratio=5.690765%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7692523, bytes=524288, 0.507798 Gbps, input_len=524288, comp_len=29656, ratio=5.656433%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7708907, bytes=524288, 0.506719 Gbps, input_len=524288, comp_len=29603, ratio=5.646324%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7711090, bytes=524288, 0.506576 Gbps, input_len=524288, comp_len=29753, ratio=5.674934%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7707617, bytes=524288, 0.506804 Gbps, input_len=524288, comp_len=29778, ratio=5.679703%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7727654, bytes=524288, 0.505490 Gbps, input_len=524288, comp_len=29714, ratio=5.667496%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7726971, bytes=524288, 0.505534 Gbps, input_len=524288, comp_len=29757, ratio=5.675697%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7726827, bytes=524288, 0.505544 Gbps, input_len=524288, comp_len=29925, ratio=5.707741%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7734837, bytes=524288, 0.505020 Gbps, input_len=524288, comp_len=29678, ratio=5.660629%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7737837, bytes=524288, 0.504825 Gbps, input_len=524288, comp_len=29821, ratio=5.687904%
[INFO] srv=COMP, tid=0, verify=0, count=1000, msec=7757345, bytes=524288, 0.503555 Gbps, input_len=524288, comp_len=29861, ratio=5.695534%

Inspecting result_comp_stderr, I see a bunch of messages indicating that it fell back to SW (which explains why 36 cores are at 100%):

Error userStarMultiProcess(-1), switch to SW if permitted
g_process.qz_init_status = QZ_NO_HW

This seems to be because NumProcesses = 6 is set in the conf files, as increasing that to 24 makes all processes to run on HW (but then each process is much slower, at about 1.4 Gbps).

Does the configuration or the test script needs to be adjusted?

cfzhu commented 1 year ago

hi iomartin, the test script needs to be changed for your platform, it tests on Intel(R) Xeon(R) Platinum 8488C, has 8 QAT devices, we will update this script in next release

iomartin commented 1 year ago

Thanks, I'll just change mine to use 12 processes in the meantime. It might also be a good idea to add -B 0 to the test call as well so that it doesn't silently falls back to software

guoanwu commented 1 year ago

The process number should match with the NumProcesses in the configuration file, then all the process will use the hardware. In the case, you are using the NumProcesses=6 in the config files and 2 device, so 12 process will use the hardware.