intel / QAT_Engine

Intel QuickAssist Technology( QAT) OpenSSL Engine (an OpenSSL Plug-In Engine) which provides cryptographic acceleration for both hardware and optimized software using Intel QuickAssist Technology enabled Intel platforms. https://developer.intel.com/quickassist
BSD 3-Clause "New" or "Revised" License
410 stars 128 forks source link

NUMA awareness #303

Closed Xeroxxx closed 3 months ago

Xeroxxx commented 8 months ago

I'm running an Intel QAT 8970 on Debian in PF Mode.

Beside its painfully slow and E5-2660v4 AES-NI seams enormous faster I got dmesg log spammed with this. QAT Accel is in a PCIe slot that belongs to NUMA Node 1 (0,1).

Is this intended? Do I need to run the card on NUMA 0?

[ 4566.872879] QAT: Device found on remote node 1 different from application node 0

root@xxx:~# cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1081006 iterations per second for 256-bit key
PBKDF2-sha256    1476867 iterations per second for 256-bit key
PBKDF2-sha512    1089995 iterations per second for 256-bit key
PBKDF2-ripemd160  629397 iterations per second for 256-bit key
PBKDF2-whirlpool  484554 iterations per second for 256-bit key
argon2i       5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id      5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
#     Algorithm |       Key |      Encryption |      Decryption
        aes-cbc        128b       345.4 MiB/s       333.7 MiB/s
    serpent-cbc        128b        81.9 MiB/s       503.2 MiB/s
    twofish-cbc        128b       183.1 MiB/s       325.2 MiB/s
        aes-cbc        256b       343.8 MiB/s       330.6 MiB/s
    serpent-cbc        256b        81.7 MiB/s       509.3 MiB/s
    twofish-cbc        256b       182.6 MiB/s       326.8 MiB/s
        aes-xts        256b       369.1 MiB/s       355.6 MiB/s
    serpent-xts        256b       469.2 MiB/s       457.7 MiB/s
    twofish-xts        256b       305.2 MiB/s       304.8 MiB/s
        aes-xts        512b       361.3 MiB/s       341.4 MiB/s
    serpent-xts        512b       467.0 MiB/s       455.1 MiB/s
    twofish-xts        512b       304.8 MiB/s       303.1 MiB/s
root@xxx:~# adf_ctl status
Checking status of all devices.
There is 3 QAT acceleration device(s) in the system:
 qat_dev0 - type: c6xx,  inst_id: 0,  node_id: 1,  bsf: 0000:83:00.0,  #accel: 5 #engines: 10 state: up
 qat_dev1 - type: c6xx,  inst_id: 1,  node_id: 1,  bsf: 0000:85:00.0,  #accel: 5 #engines: 10 state: up
 qat_dev2 - type: c6xx,  inst_id: 2,  node_id: 1,  bsf: 0000:87:00.0,  #accel: 5 #engines: 10 state: up
root@xxx:~# cat /sys/kernel/debug/qat_c6xx_0000\:83\:00.0/fw_counters
+------------------------------------------------+
| FW Statistics for Qat Device                   |
+------------------------------------------------+
| Firmware Requests [AE  0]:               24416 |
| Firmware Responses[AE  0]:               24416 |
| RAS Events        [AE  0]:                   0 |
+------------------------------------------------+
| Firmware Requests [AE  1]:               24416 |
| Firmware Responses[AE  1]:               24416 |
| RAS Events        [AE  1]:                   0 |
+------------------------------------------------+
| Firmware Requests [AE  2]:               24416 |
| Firmware Responses[AE  2]:               24416 |
| RAS Events        [AE  2]:                   0 |
+------------------------------------------------+
| Firmware Requests [AE  3]:               24416 |
| Firmware Responses[AE  3]:               24416 |
| RAS Events        [AE  3]:                   0 |
+------------------------------------------------+
| Firmware Requests [AE  4]:               24416 |
| Firmware Responses[AE  4]:               24416 |
| RAS Events        [AE  4]:                   0 |
+------------------------------------------------+
| Firmware Requests [AE  5]:               24416 |
| Firmware Responses[AE  5]:               24416 |
| RAS Events        [AE  5]:                   0 |
+------------------------------------------------+
| Firmware Requests [AE  6]:               24416 |
| Firmware Responses[AE  6]:               24416 |
| RAS Events        [AE  6]:                   0 |
+------------------------------------------------+
| Firmware Requests [AE  7]:               24416 |
| Firmware Responses[AE  7]:               24416 |
| RAS Events        [AE  7]:                   0 |
+------------------------------------------------+
| Firmware Requests [AE  8]:               24416 |
| Firmware Responses[AE  8]:               24416 |
| RAS Events        [AE  8]:                   0 |
+------------------------------------------------+
| Firmware Requests [AE  9]:               24416 |
| Firmware Responses[AE  9]:               24416 |
| RAS Events        [AE  9]:                   0 |
+------------------------------------------------+

Any suggestions?

venkatesh6911 commented 8 months ago

I am of the opinion that running QAT on NUMA node 0 or 1 should not make a difference in performance.. https://www.intel.com/content/www/us/en/developer/articles/technical/use-intel-quickassist-technology-efficiently-with-numa-awareness.html

Having said that, can you give more info on your NUMA topology. Can you give the output for the following commands:

  1. lstopo --ignore PU --merge --no-caches (you need to install hwloc or an equivalent package)
  2. lscpu
Xeroxxx commented 8 months ago

Hello @venkatesh6911, thanks for your reply. Please dont be confused its currently running in VF Mode. However that of course does not change the pcie of the physical card info.

EDIT: Note sure why is doesnt take over the formatting. https://pastebin.com/cdGbSPB8

Machine (220GB total)
  Package L#0
    NUMANode L#0 (P#0 110GB)
    Core L#0
    Core L#1
    Core L#2
    Core L#3
    Core L#4
    Core L#5
    Core L#6
    Core L#7
    Core L#8
    Core L#9
    Core L#10
    Core L#11
    Core L#12
    Core L#13
    HostBridge
      PCIBridge
        PCI 01:00.0 (SAS)
          Block(Disk) "sdf"
          Block(Disk) "sdd"
          Block(Disk) "sdb"
          Block(Disk) "sdg"
          Block(Disk) "sde"
          Block(Disk) "sdc"
          Block(Disk) "sda"
          Block(Disk) "sdh"
      PCIBridge
        PCI 02:00.0 (Ethernet)
          Net "enp2s0f0"
        PCI 02:00.1 (Ethernet)
          Net "enp2s0f1"
      PCIBridge
        PCIBridge
          PCIBridge
            PCI 08:00.0 (VGA)
          PCIBridge
            PCI 09:00.0 (VGA)
      PCI 00:11.4 (SATA)
      PCIBridge
        PCI 0b:00.0 (VGA)
      PCI 00:1f.2 (SATA)
  Package L#1
    NUMANode L#1 (P#1 110GB)
    Core L#14
    Core L#15
    Core L#16
    Core L#17
    Core L#18
    Core L#19
    Core L#20
    Core L#21
    Core L#22
    Core L#23
    Core L#24
    Core L#25
    Core L#26
    Core L#27
    HostBridge
      PCIBridge
        PCIBridge
          PCIBridge
            PCI 83:00.0 (Co-Processor)
            16 x { PCI 83:01.0-02.7 (Co-Processor) }
          PCIBridge
            PCI 85:00.0 (Co-Processor)
            16 x { PCI 85:01.0-02.7 (Co-Processor) }
          PCIBridge
            PCI 87:00.0 (Co-Processor)
            16 x { PCI 87:01.0-02.7 (Co-Processor) }
      PCIBridge
        PCI 89:00.0 (3D)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)

https://pastebin.com/QyX2Yc2v

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  56
  On-line CPU(s) list:   0-55
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel(R) Corporation
  Model name:            Intel(R) Xeon(R) CPU E5-2660 v4@ 2.00GHz
    BIOS Model name:     Intel(R) Xeon(R) CPU E5-2660 v4@ 2.00GHz  CPU @ 2.0GHz
    BIOS CPU family:     179
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  14
    Socket(s):           2
    Stepping:            1
    CPU(s) scaling MHz:  98%
    CPU max MHz:         3200.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            3990.91
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
                         nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes x
                         save avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle
                         avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi md_clear flush_l1d
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   896 KiB (28 instances)
  L1i:                   896 KiB (28 instances)
  L2:                    7 MiB (28 instances)
  L3:                    70 MiB (2 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-13,28-41
  NUMA node1 CPU(s):     14-27,42-55
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         KVM: Vulnerable
  L1tf:                  Mitigation; PTE Inversion; VMX vulnerable
  Mds:                   Vulnerable; SMT vulnerable
  Meltdown:              Vulnerable
  Mmio stale data:       Vulnerable
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
  Spectre v2:            Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
  Srbds:                 Not affected
  Tsx async abort:       Vulnerable
venkatesh6911 commented 8 months ago

Apologies for the late reply. Your NUMA layout looks fine to me. As you had said that QAT is slow, can you share the comparison test data ? Since the QAT device is in NUMA node 1, please make sure that you are running your test application on cores from node 1 only.

Xeroxxx commented 7 months ago

Apologies for the late reply. Your NUMA layout looks fine to me. As you had said that QAT is slow, can you share the comparison test data ? Since the QAT device is in NUMA node 1, please make sure that you are running your test application on cores from node 1 only.

No worries. Thank you for looking into this. Here some comparable data:

QAT (no numa preference):

root@xxx:~# grep qat /proc/crypto
driver       : pkcs1pad(qat-rsa,sha512)
driver       : qat-rsa
module       : intel_qat
driver       : qat_aes_gcm
module       : intel_qat
driver       : qat_aes_cbc_hmac_sha512
module       : intel_qat
driver       : qat_aes_cbc_hmac_sha256
module       : intel_qat
driver       : qat_aes_xts
module       : intel_qat
driver       : qat_aes_ctr
module       : intel_qat
driver       : qat_aes_cbc
module       : intel_qat

root@xxx:~# cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1081006 iterations per second for 256-bit key
PBKDF2-sha256    1466539 iterations per second for 256-bit key
PBKDF2-sha512    1094546 iterations per second for 256-bit key
PBKDF2-ripemd160  634731 iterations per second for 256-bit key
PBKDF2-whirlpool  485451 iterations per second for 256-bit key
argon2i       5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id      5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
#     Algorithm |       Key |      Encryption |      Decryption
        aes-cbc        128b       342.2 MiB/s       339.6 MiB/s
    serpent-cbc        128b        79.3 MiB/s       509.7 MiB/s
    twofish-cbc        128b       180.1 MiB/s       322.9 MiB/s
        aes-cbc        256b       351.1 MiB/s       333.5 MiB/s
    serpent-cbc        256b        81.7 MiB/s       510.7 MiB/s
    twofish-cbc        256b       181.9 MiB/s       324.9 MiB/s
        aes-xts        256b       374.1 MiB/s       351.6 MiB/s
    serpent-xts        256b       456.1 MiB/s       455.3 MiB/s
    twofish-xts        256b       300.7 MiB/s       303.0 MiB/s
        aes-xts        512b       368.9 MiB/s       353.3 MiB/s
    serpent-xts        512b       465.1 MiB/s       456.3 MiB/s
    twofish-xts        512b       305.4 MiB/s       303.6 MiB/s

Running on E5-2660v4:

root@xxx:~# /etc/init.d/qat_service status
Checking status of all devices.
There is 3 QAT acceleration device(s) in the system:
 qat_dev0 - type: c6xx,  inst_id: 0,  node_id: 1,  bsf: 0000:83:00.0,  #accel: 5 #engines: 10 state: down
 qat_dev1 - type: c6xx,  inst_id: 1,  node_id: 1,  bsf: 0000:85:00.0,  #accel: 5 #engines: 10 state: down
 qat_dev2 - type: c6xx,  inst_id: 2,  node_id: 1,  bsf: 0000:87:00.0,  #accel: 5 #engines: 10 state: down
root@xxx:~# cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1068884 iterations per second for 256-bit key
PBKDF2-sha256    1474790 iterations per second for 256-bit key
PBKDF2-sha512    1093405 iterations per second for 256-bit key
PBKDF2-ripemd160  634731 iterations per second for 256-bit key
PBKDF2-whirlpool  483660 iterations per second for 256-bit key
argon2i       5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id      5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
#     Algorithm |       Key |      Encryption |      Decryption
        aes-cbc        128b       589.8 MiB/s      2339.8 MiB/s
    serpent-cbc        128b        82.4 MiB/s       513.4 MiB/s
    twofish-cbc        128b       183.2 MiB/s       326.0 MiB/s
        aes-cbc        256b       444.1 MiB/s      1841.0 MiB/s
    serpent-cbc        256b        81.8 MiB/s       512.3 MiB/s
    twofish-cbc        256b       183.4 MiB/s       327.8 MiB/s
        aes-xts        256b      2011.0 MiB/s      2028.8 MiB/s
    serpent-xts        256b       470.7 MiB/s       457.1 MiB/s
    twofish-xts        256b       305.9 MiB/s       304.0 MiB/s
        aes-xts        512b      1607.0 MiB/s      1610.4 MiB/s
    serpent-xts        512b       467.4 MiB/s       457.8 MiB/s
    twofish-xts        512b       305.6 MiB/s       304.1 MiB/s

Will retry and switch QAT from VF to PF and disable IOMMU and run cryptsetup on NUMA 1 cores (as this is current configuration, tests where made in PF mode).

Xeroxxx commented 7 months ago

@venkatesh6911 So running on NUMA Node 1 where the QAT is attached to PCIe wise. Setting the specific cores via taskset to 20-27 (see cpuinfo above).

Without taskset I get the message: [ 4462.271528] QAT: Device found on remote node 1 different from application node 0 With taskset the message dissappears as 20-27 is on NUMA 01. Great!

tasket -c 20-27 cryptsetup benchmark

Result on NUMA 1 in PF Mode:

root@xxx:~# taskset -c 20-27 cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1082121 iterations per second for 256-bit key
PBKDF2-sha256    1474790 iterations per second for 256-bit key
PBKDF2-sha512    1093405 iterations per second for 256-bit key
PBKDF2-ripemd160  633963 iterations per second for 256-bit key
PBKDF2-whirlpool  481882 iterations per second for 256-bit key
argon2i       5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id      5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
#     Algorithm |       Key |      Encryption |      Decryption
        aes-cbc        128b       379.8 MiB/s       366.3 MiB/s
    serpent-cbc        128b        82.5 MiB/s       513.9 MiB/s
    twofish-cbc        128b       184.1 MiB/s       327.6 MiB/s
        aes-cbc        256b       407.2 MiB/s       408.0 MiB/s
    serpent-cbc        256b        82.6 MiB/s       513.8 MiB/s
    twofish-cbc        256b       184.5 MiB/s       329.4 MiB/s
        aes-xts        256b       413.3 MiB/s       412.7 MiB/s
    serpent-xts        256b       471.9 MiB/s       460.2 MiB/s
    twofish-xts        256b       305.6 MiB/s       307.5 MiB/s
        aes-xts        512b       410.1 MiB/s       409.6 MiB/s
    serpent-xts        512b       471.4 MiB/s       459.4 MiB/s
    twofish-xts        512b       308.2 MiB/s       307.3 MiB/s
root@xxx:~# grep qat /proc/crypto
driver       : pkcs1pad(qat-rsa,sha512)
driver       : qat-rsa
module       : intel_qat
driver       : qat_aes_gcm
module       : intel_qat
driver       : qat_aes_cbc_hmac_sha512
module       : intel_qat
driver       : qat_aes_cbc_hmac_sha256
module       : intel_qat
driver       : qat_aes_xts
module       : intel_qat
driver       : qat_aes_ctr
module       : intel_qat
driver       : qat_aes_cbc
module       : intel_qat

Running on E5-2660v4:

root@xxx:~# /etc/init.d/qat_service status
Checking status of all devices.
There is 3 QAT acceleration device(s) in the system:
 qat_dev0 - type: c6xx,  inst_id: 0,  node_id: 1,  bsf: 0000:83:00.0,  #accel: 5 #engines: 10 state: down
 qat_dev1 - type: c6xx,  inst_id: 1,  node_id: 1,  bsf: 0000:85:00.0,  #accel: 5 #engines: 10 state: down
 qat_dev2 - type: c6xx,  inst_id: 2,  node_id: 1,  bsf: 0000:87:00.0,  #accel: 5 #engines: 10 state: down
root@xxx:~# cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1068884 iterations per second for 256-bit key
PBKDF2-sha256    1474790 iterations per second for 256-bit key
PBKDF2-sha512    1093405 iterations per second for 256-bit key
PBKDF2-ripemd160  634731 iterations per second for 256-bit key
PBKDF2-whirlpool  483660 iterations per second for 256-bit key
argon2i       5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id      5 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
#     Algorithm |       Key |      Encryption |      Decryption
        aes-cbc        128b       589.8 MiB/s      2339.8 MiB/s
    serpent-cbc        128b        82.4 MiB/s       513.4 MiB/s
    twofish-cbc        128b       183.2 MiB/s       326.0 MiB/s
        aes-cbc        256b       444.1 MiB/s      1841.0 MiB/s
    serpent-cbc        256b        81.8 MiB/s       512.3 MiB/s
    twofish-cbc        256b       183.4 MiB/s       327.8 MiB/s
        aes-xts        256b      2011.0 MiB/s      2028.8 MiB/s
    serpent-xts        256b       470.7 MiB/s       457.1 MiB/s
    twofish-xts        256b       305.9 MiB/s       304.0 MiB/s
        aes-xts        512b      1607.0 MiB/s      1610.4 MiB/s
    serpent-xts        512b       467.4 MiB/s       457.8 MiB/s
    twofish-xts        512b       305.6 MiB/s       304.1 MiB/s

Expectation: QAT Accel is faster than CPU Reallity: QAT is slower than CPU

However when using VF wihtin a VM performance is great!

venkatesh6911 commented 7 months ago

I see that in the specification, AES-CBC gets around 103Gbps for 4KB packet size. https://cdrdv2-public.intel.com/691474/intel-quickassist-adapter-8960-8970.pdf

I will check on this and get back to you soon..

venkatesh6911 commented 7 months ago

We just came across a similar issue reported internally.

Quoting from the issue, "On a multi-socket platform, there can be a performance degradation on the remote sockets. This can arise when either the threads are not affinitised to the core on the socket the device is on and/or the memory is not allocated on the appropriate NUMA node."

So, QAT performance on NUMA node 1 will be low compared to node 0 in PF mode. In VF mode, the memory allocations are more likely to be done on the remote socket, with minimal performance impact. -> This explains the data you got with VF mode.

This issue will be fixed in a future release.

Xeroxxx commented 7 months ago

So, QAT performance on NUMA node 1 will be low compared to node 0 in PF mode. In VF mode, the memory allocations are more likely to be done on the remote socket, with minimal performance impact. -> This explains the data you got with VF mode.

Even when QAT is attached to NUMA 1? Remote is technically NUMA 0. I can't get good performance in PF Mode no matter what.

However I'm running in VF Mode, but great to hear you are on it!

Feel free to close this issue or keep it open until fixed :)

Thank you @venkatesh6911