Closed Xeroxxx closed 3 months ago
I am of the opinion that running QAT on NUMA node 0 or 1 should not make a difference in performance.. https://www.intel.com/content/www/us/en/developer/articles/technical/use-intel-quickassist-technology-efficiently-with-numa-awareness.html
Having said that, can you give more info on your NUMA topology. Can you give the output for the following commands:
lstopo --ignore PU --merge --no-caches
(you need to install hwloc or an equivalent package)lscpu
Hello @venkatesh6911, thanks for your reply. Please dont be confused its currently running in VF Mode. However that of course does not change the pcie of the physical card info.
EDIT: Note sure why is doesnt take over the formatting. https://pastebin.com/cdGbSPB8
Machine (220GB total)
Package L#0
NUMANode L#0 (P#0 110GB)
Core L#0
Core L#1
Core L#2
Core L#3
Core L#4
Core L#5
Core L#6
Core L#7
Core L#8
Core L#9
Core L#10
Core L#11
Core L#12
Core L#13
HostBridge
PCIBridge
PCI 01:00.0 (SAS)
Block(Disk) "sdf"
Block(Disk) "sdd"
Block(Disk) "sdb"
Block(Disk) "sdg"
Block(Disk) "sde"
Block(Disk) "sdc"
Block(Disk) "sda"
Block(Disk) "sdh"
PCIBridge
PCI 02:00.0 (Ethernet)
Net "enp2s0f0"
PCI 02:00.1 (Ethernet)
Net "enp2s0f1"
PCIBridge
PCIBridge
PCIBridge
PCI 08:00.0 (VGA)
PCIBridge
PCI 09:00.0 (VGA)
PCI 00:11.4 (SATA)
PCIBridge
PCI 0b:00.0 (VGA)
PCI 00:1f.2 (SATA)
Package L#1
NUMANode L#1 (P#1 110GB)
Core L#14
Core L#15
Core L#16
Core L#17
Core L#18
Core L#19
Core L#20
Core L#21
Core L#22
Core L#23
Core L#24
Core L#25
Core L#26
Core L#27
HostBridge
PCIBridge
PCIBridge
PCIBridge
PCI 83:00.0 (Co-Processor)
16 x { PCI 83:01.0-02.7 (Co-Processor) }
PCIBridge
PCI 85:00.0 (Co-Processor)
16 x { PCI 85:01.0-02.7 (Co-Processor) }
PCIBridge
PCI 87:00.0 (Co-Processor)
16 x { PCI 87:01.0-02.7 (Co-Processor) }
PCIBridge
PCI 89:00.0 (3D)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
Model name: Intel(R) Xeon(R) CPU E5-2660 v4@ 2.00GHz
BIOS Model name: Intel(R) Xeon(R) CPU E5-2660 v4@ 2.00GHz CPU @ 2.0GHz
BIOS CPU family: 179
CPU family: 6
Model: 79
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
Stepping: 1
CPU(s) scaling MHz: 98%
CPU max MHz: 3200.0000
CPU min MHz: 1200.0000
BogoMIPS: 3990.91
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes x
save avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle
avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi md_clear flush_l1d
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 896 KiB (28 instances)
L1i: 896 KiB (28 instances)
L2: 7 MiB (28 instances)
L3: 70 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: KVM: Vulnerable
L1tf: Mitigation; PTE Inversion; VMX vulnerable
Mds: Vulnerable; SMT vulnerable
Meltdown: Vulnerable
Mmio stale data: Vulnerable
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Vulnerable
Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
Srbds: Not affected
Tsx async abort: Vulnerable
Apologies for the late reply. Your NUMA layout looks fine to me. As you had said that QAT is slow, can you share the comparison test data ? Since the QAT device is in NUMA node 1, please make sure that you are running your test application on cores from node 1 only.
Apologies for the late reply. Your NUMA layout looks fine to me. As you had said that QAT is slow, can you share the comparison test data ? Since the QAT device is in NUMA node 1, please make sure that you are running your test application on cores from node 1 only.
No worries. Thank you for looking into this. Here some comparable data:
QAT (no numa preference):
root@xxx:~# grep qat /proc/crypto
driver : pkcs1pad(qat-rsa,sha512)
driver : qat-rsa
module : intel_qat
driver : qat_aes_gcm
module : intel_qat
driver : qat_aes_cbc_hmac_sha512
module : intel_qat
driver : qat_aes_cbc_hmac_sha256
module : intel_qat
driver : qat_aes_xts
module : intel_qat
driver : qat_aes_ctr
module : intel_qat
driver : qat_aes_cbc
module : intel_qat
root@xxx:~# cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1 1081006 iterations per second for 256-bit key
PBKDF2-sha256 1466539 iterations per second for 256-bit key
PBKDF2-sha512 1094546 iterations per second for 256-bit key
PBKDF2-ripemd160 634731 iterations per second for 256-bit key
PBKDF2-whirlpool 485451 iterations per second for 256-bit key
argon2i 5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id 5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
# Algorithm | Key | Encryption | Decryption
aes-cbc 128b 342.2 MiB/s 339.6 MiB/s
serpent-cbc 128b 79.3 MiB/s 509.7 MiB/s
twofish-cbc 128b 180.1 MiB/s 322.9 MiB/s
aes-cbc 256b 351.1 MiB/s 333.5 MiB/s
serpent-cbc 256b 81.7 MiB/s 510.7 MiB/s
twofish-cbc 256b 181.9 MiB/s 324.9 MiB/s
aes-xts 256b 374.1 MiB/s 351.6 MiB/s
serpent-xts 256b 456.1 MiB/s 455.3 MiB/s
twofish-xts 256b 300.7 MiB/s 303.0 MiB/s
aes-xts 512b 368.9 MiB/s 353.3 MiB/s
serpent-xts 512b 465.1 MiB/s 456.3 MiB/s
twofish-xts 512b 305.4 MiB/s 303.6 MiB/s
Running on E5-2660v4:
root@xxx:~# /etc/init.d/qat_service status
Checking status of all devices.
There is 3 QAT acceleration device(s) in the system:
qat_dev0 - type: c6xx, inst_id: 0, node_id: 1, bsf: 0000:83:00.0, #accel: 5 #engines: 10 state: down
qat_dev1 - type: c6xx, inst_id: 1, node_id: 1, bsf: 0000:85:00.0, #accel: 5 #engines: 10 state: down
qat_dev2 - type: c6xx, inst_id: 2, node_id: 1, bsf: 0000:87:00.0, #accel: 5 #engines: 10 state: down
root@xxx:~# cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1 1068884 iterations per second for 256-bit key
PBKDF2-sha256 1474790 iterations per second for 256-bit key
PBKDF2-sha512 1093405 iterations per second for 256-bit key
PBKDF2-ripemd160 634731 iterations per second for 256-bit key
PBKDF2-whirlpool 483660 iterations per second for 256-bit key
argon2i 5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id 5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
# Algorithm | Key | Encryption | Decryption
aes-cbc 128b 589.8 MiB/s 2339.8 MiB/s
serpent-cbc 128b 82.4 MiB/s 513.4 MiB/s
twofish-cbc 128b 183.2 MiB/s 326.0 MiB/s
aes-cbc 256b 444.1 MiB/s 1841.0 MiB/s
serpent-cbc 256b 81.8 MiB/s 512.3 MiB/s
twofish-cbc 256b 183.4 MiB/s 327.8 MiB/s
aes-xts 256b 2011.0 MiB/s 2028.8 MiB/s
serpent-xts 256b 470.7 MiB/s 457.1 MiB/s
twofish-xts 256b 305.9 MiB/s 304.0 MiB/s
aes-xts 512b 1607.0 MiB/s 1610.4 MiB/s
serpent-xts 512b 467.4 MiB/s 457.8 MiB/s
twofish-xts 512b 305.6 MiB/s 304.1 MiB/s
Will retry and switch QAT from VF to PF and disable IOMMU and run cryptsetup on NUMA 1 cores (as this is current configuration, tests where made in PF mode).
@venkatesh6911 So running on NUMA Node 1 where the QAT is attached to PCIe wise. Setting the specific cores via taskset to 20-27 (see cpuinfo above).
Without taskset I get the message:
[ 4462.271528] QAT: Device found on remote node 1 different from application node 0
With taskset the message dissappears as 20-27 is on NUMA 01. Great!
tasket -c 20-27 cryptsetup benchmark
Result on NUMA 1 in PF Mode:
root@xxx:~# taskset -c 20-27 cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1 1082121 iterations per second for 256-bit key
PBKDF2-sha256 1474790 iterations per second for 256-bit key
PBKDF2-sha512 1093405 iterations per second for 256-bit key
PBKDF2-ripemd160 633963 iterations per second for 256-bit key
PBKDF2-whirlpool 481882 iterations per second for 256-bit key
argon2i 5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id 5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
# Algorithm | Key | Encryption | Decryption
aes-cbc 128b 379.8 MiB/s 366.3 MiB/s
serpent-cbc 128b 82.5 MiB/s 513.9 MiB/s
twofish-cbc 128b 184.1 MiB/s 327.6 MiB/s
aes-cbc 256b 407.2 MiB/s 408.0 MiB/s
serpent-cbc 256b 82.6 MiB/s 513.8 MiB/s
twofish-cbc 256b 184.5 MiB/s 329.4 MiB/s
aes-xts 256b 413.3 MiB/s 412.7 MiB/s
serpent-xts 256b 471.9 MiB/s 460.2 MiB/s
twofish-xts 256b 305.6 MiB/s 307.5 MiB/s
aes-xts 512b 410.1 MiB/s 409.6 MiB/s
serpent-xts 512b 471.4 MiB/s 459.4 MiB/s
twofish-xts 512b 308.2 MiB/s 307.3 MiB/s
root@xxx:~# grep qat /proc/crypto
driver : pkcs1pad(qat-rsa,sha512)
driver : qat-rsa
module : intel_qat
driver : qat_aes_gcm
module : intel_qat
driver : qat_aes_cbc_hmac_sha512
module : intel_qat
driver : qat_aes_cbc_hmac_sha256
module : intel_qat
driver : qat_aes_xts
module : intel_qat
driver : qat_aes_ctr
module : intel_qat
driver : qat_aes_cbc
module : intel_qat
Running on E5-2660v4:
root@xxx:~# /etc/init.d/qat_service status
Checking status of all devices.
There is 3 QAT acceleration device(s) in the system:
qat_dev0 - type: c6xx, inst_id: 0, node_id: 1, bsf: 0000:83:00.0, #accel: 5 #engines: 10 state: down
qat_dev1 - type: c6xx, inst_id: 1, node_id: 1, bsf: 0000:85:00.0, #accel: 5 #engines: 10 state: down
qat_dev2 - type: c6xx, inst_id: 2, node_id: 1, bsf: 0000:87:00.0, #accel: 5 #engines: 10 state: down
root@xxx:~# cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1 1068884 iterations per second for 256-bit key
PBKDF2-sha256 1474790 iterations per second for 256-bit key
PBKDF2-sha512 1093405 iterations per second for 256-bit key
PBKDF2-ripemd160 634731 iterations per second for 256-bit key
PBKDF2-whirlpool 483660 iterations per second for 256-bit key
argon2i 5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id 5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
# Algorithm | Key | Encryption | Decryption
aes-cbc 128b 589.8 MiB/s 2339.8 MiB/s
serpent-cbc 128b 82.4 MiB/s 513.4 MiB/s
twofish-cbc 128b 183.2 MiB/s 326.0 MiB/s
aes-cbc 256b 444.1 MiB/s 1841.0 MiB/s
serpent-cbc 256b 81.8 MiB/s 512.3 MiB/s
twofish-cbc 256b 183.4 MiB/s 327.8 MiB/s
aes-xts 256b 2011.0 MiB/s 2028.8 MiB/s
serpent-xts 256b 470.7 MiB/s 457.1 MiB/s
twofish-xts 256b 305.9 MiB/s 304.0 MiB/s
aes-xts 512b 1607.0 MiB/s 1610.4 MiB/s
serpent-xts 512b 467.4 MiB/s 457.8 MiB/s
twofish-xts 512b 305.6 MiB/s 304.1 MiB/s
Expectation: QAT Accel is faster than CPU Reallity: QAT is slower than CPU
However when using VF wihtin a VM performance is great!
I see that in the specification, AES-CBC gets around 103Gbps for 4KB packet size. https://cdrdv2-public.intel.com/691474/intel-quickassist-adapter-8960-8970.pdf
I will check on this and get back to you soon..
We just came across a similar issue reported internally.
Quoting from the issue, "On a multi-socket platform, there can be a performance degradation on the remote sockets. This can arise when either the threads are not affinitised to the core on the socket the device is on and/or the memory is not allocated on the appropriate NUMA node."
So, QAT performance on NUMA node 1 will be low compared to node 0 in PF mode. In VF mode, the memory allocations are more likely to be done on the remote socket, with minimal performance impact. -> This explains the data you got with VF mode.
This issue will be fixed in a future release.
So, QAT performance on NUMA node 1 will be low compared to node 0 in PF mode. In VF mode, the memory allocations are more likely to be done on the remote socket, with minimal performance impact. -> This explains the data you got with VF mode.
Even when QAT is attached to NUMA 1? Remote is technically NUMA 0. I can't get good performance in PF Mode no matter what.
However I'm running in VF Mode, but great to hear you are on it!
Feel free to close this issue or keep it open until fixed :)
Thank you @venkatesh6911
I'm running an Intel QAT 8970 on Debian in PF Mode.
Beside its painfully slow and E5-2660v4 AES-NI seams enormous faster I got dmesg log spammed with this. QAT Accel is in a PCIe slot that belongs to NUMA Node 1 (0,1).
Is this intended? Do I need to run the card on NUMA 0?
[ 4566.872879] QAT: Device found on remote node 1 different from application node 0
Any suggestions?