tempesta-tech / tempesta

All-in-one solution for high performance web content delivery and advanced protection against DDoS and web attacks
https://tempesta-tech.com/
GNU General Public License v2.0
613 stars 103 forks source link

Migrate to a Linux 6.8 kernel #1808

Open krizhanovsky opened 1 year ago

krizhanovsky commented 1 year ago

We were living with 5.10 for too long, it's time to migrate to the 6.1 longterm kernel.

Please update all the https://github.com/tempesta-tech/tempesta/wiki pages referencing the old kernel.

Please also grep and fix all TODO #1808 comments.

osevan commented 1 year ago

Sources already out ?

Can i get link to source?

Btw, where i can met tempesta tech authors in chat? Telegram channel doesnt exist.

Irc maybe?

Newest lts kernel will be 6.1 according linux kernel maintainer..

Thanks and

Best regards

krizhanovsky commented 1 year ago

Hi @osevan ,

so far we have only https://github.com/tempesta-tech/linux-5.10.35-tfw , which should be replaced with a newer longterm kernel in the next release.

Unfortunately, we don't have a public chat yet.

osevan commented 1 year ago

Can we create one in telegram? t.me/tempestatech We can bring community together.

krizhanovsky commented 1 year ago

Hi @osevan ,

at some time we had a public chat in Slack, but it vanished after some time due to not enough traction. It still makes sense to create a chat (BTW Telegram looks like a good platform) and a Reddit group, but not earlier than we reach GA.

krizhanovsky commented 7 months ago

Probably it makes sense to migrate to 6.8 or later, even not stable yet, to get the TCP performance optimizations

kingluo commented 4 months ago

Probably it makes sense to migrate to 6.8 or later, even not stable yet, to get the TCP performance optimizations

I have some questions:

  1. 6.8 is not an LTS release so I want to double-check if it is worth developing this release.
  2. We need to create a new repository just like https://github.com/tempesta-tech/linux-5.10.35-tfw, or just a new patch file?
  3. Which commit is the end point where we need to migrate to the new kernel version? You know the master branch is growing every day.
krizhanovsky commented 4 months ago

I created new repo for the kernel https://github.com/tempesta-tech/linux-6.8.9-tfw , so please

kingluo commented 4 months ago

I divided this issue into 7 parts:


1. fpu invalid opcode: 0000

kernel_fpu_begin_mask() --> asm ldmxcsr will cause an exception in the interrupt context when the system boots at the initrd stage.

Unlike the old version, the new kernel raises softirq even in the boot phase, when fpu related stuff (such as registers or process fpu state) is not ready for manipulation (otherwise exception raised), so it should be enabled in the first kernel_fpu_begin() (i.e. the first crypto API call) instead of __do_softirq().

[    0.000000] Linux version 6.8.9+ (kingluo@test) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #50 SMP PREEMPT_DYNAMIC Thu May 23 17:29:03 CST 2024
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.9+ root=UUID=af9a86ba-8a14-4ad9-b7d2-78508d3c3f1e ro recovery nomodeset dis_ucode_ldr console=ttyS0,38400
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000]   Hygon HygonGenuine
[    0.000000]   Centaur CentaurHauls
[    0.000000]   zhaoxin   Shanghai
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000013fffffff] usable
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] APIC: Static calls initialized
[    0.000000] SMBIOS 3.0.0 present.
[    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    0.000000] last_pfn = 0x140000 max_arch_pfn = 0x400000000
[    0.000000] MTRR map: 4 entries (3 fixed + 1 variable; max 19), built from 8 variable MTRRs
[    0.000000] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
[    0.000000] last_pfn = 0xbffe0 max_arch_pfn = 0x400000000
[    0.000000] found SMP MP-table at [mem 0x000f5480-0x000f548f]
[    0.000000] RAMDISK: [mem 0x33b89000-0x35dbbfff]
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x00000000000F5270 000014 (v00 BOCHS )
[    0.000000] ACPI: RSDT 0x00000000BFFE1D57 000034 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.000000] ACPI: FACP 0x00000000BFFE1BF3 000074 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.000000] ACPI: DSDT 0x00000000BFFE0040 001BB3 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.000000] ACPI: FACS 0x00000000BFFE0000 000040
[    0.000000] ACPI: APIC 0x00000000BFFE1C67 000090 (v03 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.000000] ACPI: HPET 0x00000000BFFE1CF7 000038 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.000000] ACPI: WAET 0x00000000BFFE1D2F 000028 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.000000] ACPI: Reserving FACP table memory at [mem 0xbffe1bf3-0xbffe1c66]
[    0.000000] ACPI: Reserving DSDT table memory at [mem 0xbffe0040-0xbffe1bf2]
[    0.000000] ACPI: Reserving FACS table memory at [mem 0xbffe0000-0xbffe003f]
[    0.000000] ACPI: Reserving APIC table memory at [mem 0xbffe1c67-0xbffe1cf6]
[    0.000000] ACPI: Reserving HPET table memory at [mem 0xbffe1cf7-0xbffe1d2e]
[    0.000000] ACPI: Reserving WAET table memory at [mem 0xbffe1d2f-0xbffe1d56]
[    0.000000] No NUMA configuration found
[    0.000000] Faking a node at [mem 0x0000000000000000-0x000000013fffffff]
[    0.000000] NODE_DATA(0) allocated [mem 0x13ffd2000-0x13fffcfff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.000000]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x000000013fffffff]
[    0.000000]   Device   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.000000]   node   0: [mem 0x0000000000100000-0x00000000bffdffff]
[    0.000000]   node   0: [mem 0x0000000100000000-0x000000013fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000013fffffff]
[    0.000000] On node 0, zone DMA: 1 pages in unavailable ranges
[    0.000000] On node 0, zone DMA: 97 pages in unavailable ranges
[    0.000000] On node 0, zone Normal: 32 pages in unavailable ranges
[    0.000000] ACPI: PM-Timer IO Port: 0x608
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[    0.000000] IOAPIC[0]: apic_id 0, version 32, address 0xfec00000, GSI 0-23
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
[    0.000000] ACPI: Using ACPI (MADT) for SMP configuration information
[    0.000000] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[    0.000000] TSC deadline timer available
[    0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[    0.000000] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.000000] PM: hibernation: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
[    0.000000] PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000effff]
[    0.000000] PM: hibernation: Registered nosave memory: [mem 0x000f0000-0x000fffff]
[    0.000000] PM: hibernation: Registered nosave memory: [mem 0xbffe0000-0xbfffffff]
[    0.000000] PM: hibernation: Registered nosave memory: [mem 0xc0000000-0xfffbffff]
[    0.000000] PM: hibernation: Registered nosave memory: [mem 0xfffc0000-0xffffffff]
[    0.000000] [mem 0xc0000000-0xfffbffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on bare hardware
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.000000] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:4 nr_cpu_ids:4 nr_node_ids:1
[    0.000000] percpu: Embedded 64 pages/cpu s225280 r8192 d28672 u524288
[    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.9+ root=UUID=af9a86ba-8a14-4ad9-b7d2-78508d3c3f1e ro recovery nomodeset dis_ucode_ldr console=ttyS0,38400
[    0.000000] Booted with the nomodeset parameter. Only the system framebuffer will be available
[    0.000000] Unknown kernel command line parameters "recovery dis_ucode_ldr BOOT_IMAGE=/boot/vmlinuz-6.8.9+", will be passed to user space.
[    0.000000] random: crng init done
[    0.000000] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes, linear)
[    0.000000] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes, linear)
[    0.000000] Fallback order for Node 0: 0
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 1031904
[    0.000000] Policy zone: Normal
[    0.000000] mem auto-init: stack:off, heap alloc:on, heap free:off
[    0.000000] software IO TLB: area num 4.
[    0.000000] Memory: 3971448K/4193784K available (18432K kernel code, 3189K rwdata, 7024K rodata, 12736K init, 4384K bss, 222076K reserved, 0K cma-reserved)
[    0.000000] Tempesta: allocated huge pages space (____ptrval____) 512MB at node 0
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.000000] ftrace: allocating 51518 entries in 202 pages
[    0.000000] ftrace: allocated 202 pages with 4 groups
[    0.000000] Dynamic Preempt: voluntary
[    0.000000] rcu: Preemptible hierarchical RCU implementation.
[    0.000000] rcu:     RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=4.
[    0.000000]  Trampoline variant of Tasks RCU enabled.
[    0.000000]  Rude variant of Tasks RCU enabled.
[    0.000000]  Tracing variant of Tasks RCU enabled.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
[    0.000000] NR_IRQS: 524544, nr_irqs: 456, preallocated irqs: 16
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] Console: colour VGA+ 80x25
[    0.000000] printk: legacy console [ttyS0] enabled
[    0.000000] ACPI: Core revision 20230628
[    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
[    0.004000] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[    0.004000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.9+ #50
[    0.004000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    0.004000] RIP: 0010:__kernel_fpu_begin_mask+0x55/0xa0
[    0.004000] Code: 65 48 8b 3c 25 c0 36 03 00 f7 47 2c 00 40 20 00 74 3a 65 48 c7 05 7f 4a dc 58 00 00 00 00 f6 c3 02 74 0b c7 45 ec 80 1f 00 00 <0f> ae 55 ec 83 e3 01 75 34 48 8b 45 f0 65 48 2b 04 25 28 00 00 00
[    0.004000] RSP: 0000:ffffba7580003f58 EFLAGS: 00010002
[    0.004000] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
[    0.004000] RDX: 0000000000000001 RSI: 0000000004200000 RDI: ffffffffa8c0a980
[    0.004000] RBP: ffffba7580003f70 R08: 00000036e863fce8 R09: c8656ec1deb9f021
[    0.004000] R10: ffffba7580003dc0 R11: ffffffffa8d50728 R12: ffffffffa8c03d68
[    0.004000] R13: 0000000000000200 R14: 0000000000000000 R15: 0000000000000000
[    0.004000] FS:  0000000000000000(0000) GS:ffff9ddb3bc00000(0000) knlGS:0000000000000000
[    0.004000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.004000] CR2: ffff9ddb3ffff000 CR3: 0000000006a2e000 CR4: 00000000000000b0
[    0.004000] Call Trace:
[    0.004000]  <IRQ>
[    0.004000]  ? show_regs+0x6e/0x80
[    0.004000]  ? die+0x3c/0xa0
[    0.004000]  ? do_trap+0xd4/0xf0
[    0.004000]  ? do_error_trap+0x75/0xa0
[    0.004000]  ? __kernel_fpu_begin_mask+0x55/0xa0
[    0.004000]  ? exc_invalid_op+0x57/0x80
[    0.004000]  ? __kernel_fpu_begin_mask+0x55/0xa0
[    0.004000]  ? asm_exc_invalid_op+0x1f/0x30
[    0.004000]  ? __kernel_fpu_begin_mask+0x55/0xa0
[    0.004000]  __do_softirq+0x65/0x2af
[    0.004000]  __irq_exit_rcu+0x6b/0x90
[    0.004000]  irq_exit_rcu+0x12/0x20
[    0.004000]  common_interrupt+0x92/0xa0
[    0.004000]  </IRQ>
[    0.004000]  <TASK>
[    0.004000]  asm_common_interrupt+0x2b/0x40
[    0.004000] RIP: 0010:__x86_return_thunk+0x0/0x10
[    0.004000] Code: e8 01 00 00 00 cc e8 01 00 00 00 cc 48 81 c4 80 00 00 00 65 48 c7 04 25 d0 36 03 00 ff ff ff ff c3 cc 0f 1f 84 00 00 00 00 00 <c3> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc e9 db 7d 10 ff 0f
[    0.004000] RSP: 0000:ffffffffa8c03e18 EFLAGS: 00000206
[    0.004000] RAX: 0000000000000001 RBX: ffff9ddb001a2360 RCX: 0000000000015a00
[    0.004000] RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff9ddb001a22a4
[    0.004000] RBP: ffffffffa8c03e20 R08: 00000000ffffffea R09: 0000000000000000
[    0.004000] R10: 0000000000000246 R11: ffff9ddb001a22a4 R12: ffff9ddb001a2200
[    0.004000] R13: ffff9ddb00197e00 R14: 0000000000000000 R15: 0000000000000000
[    0.004000]  ? _raw_spin_unlock_irqrestore+0x21/0x40
[    0.004000]  __setup_irq+0x4ce/0x7b0
[    0.004000]  request_threaded_irq+0x116/0x180
[    0.004000]  hpet_time_init+0x3e/0x60
[    0.004000]  x86_late_time_init+0x1f/0x40
[    0.004000]  start_kernel+0x442/0x780
[    0.004000]  x86_64_start_reservations+0x1c/0x30
[    0.004000]  x86_64_start_kernel+0x80/0x80
[    0.004000]  secondary_startup_64_no_verify+0x175/0x17b
[    0.004000]  </TASK>
[    0.004000] Modules linked in:
[    0.004000] ---[ end trace 0000000000000000 ]---
[    0.004000] RIP: 0010:__kernel_fpu_begin_mask+0x55/0xa0
[    0.004000] Code: 65 48 8b 3c 25 c0 36 03 00 f7 47 2c 00 40 20 00 74 3a 65 48 c7 05 7f 4a dc 58 00 00 00 00 f6 c3 02 74 0b c7 45 ec 80 1f 00 00 <0f> ae 55 ec 83 e3 01 75 34 48 8b 45 f0 65 48 2b 04 25 28 00 00 00
[    0.004000] RSP: 0000:ffffba7580003f58 EFLAGS: 00010002
[    0.004000] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
[    0.004000] RDX: 0000000000000001 RSI: 0000000004200000 RDI: ffffffffa8c0a980
[    0.004000] RBP: ffffba7580003f70 R08: 00000036e863fce8 R09: c8656ec1deb9f021
[    0.004000] R10: ffffba7580003dc0 R11: ffffffffa8d50728 R12: ffffffffa8c03d68
[    0.004000] R13: 0000000000000200 R14: 0000000000000000 R15: 0000000000000000
[    0.004000] FS:  0000000000000000(0000) GS:ffff9ddb3bc00000(0000) knlGS:0000000000000000
[    0.004000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.004000] CR2: ffff9ddb3ffff000 CR3: 0000000006a2e000 CR4: 00000000000000b0
[    0.004000] Kernel panic - not syncing: Fatal exception in interrupt
[    0.004000] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

2. paged-skb-patch

The skb stuff is refactored by the latest kernel a lot, for example, now the kernel uses kmalloc_reserve() to alloc skb and return the actual size by the way, so the simple migration at this part will cause memory error, e.g. memcpy() segfault.

kingluo commented 4 months ago

3. Assembly problem: endbr64-disallow-indirect-jump

We have below assembly functions:

fw/str_avx2.S:SYM_FUNC_START(__tfw_strtolower_avx2)
fw/str_avx2.S:SYM_FUNC_START(__tfw_stricmp_avx2)
fw/str_avx2.S:SYM_FUNC_START(__tfw_stricmp_avx2_2lc)
fw/str_avx2.S:SYM_FUNC_START(__tfw_match_custom)
fw/str_avx2.S:SYM_FUNC_START(__tfw_match_ctext_vchar)
fw/str_avx2.S:SYM_FUNC_START(tfw_match_uri)
fw/str_avx2.S:SYM_FUNC_START(tfw_match_token)
fw/str_avx2.S:SYM_FUNC_START(tfw_match_token_lc)
fw/str_avx2.S:SYM_FUNC_START(tfw_match_qetoken)
fw/str_avx2.S:SYM_FUNC_START(tfw_match_nctl)
fw/str_avx2.S:SYM_FUNC_START(tfw_match_xff)
fw/str_avx2.S:SYM_FUNC_START(tfw_match_cookie)
fw/str_avx2.S:SYM_FUNC_START(__tfw_match_etag)
lib/str_simd.S:SYM_FUNC_START(__memcpy_fast)
lib/str_simd.S:SYM_FUNC_START(__memcmp_fast)
lib/str_simd.S:SYM_FUNC_START(__bzero_fast)
tls/bignum_x86-64.S:SYM_FUNC_START(mpi_cmp_x86_64_4)
tls/bignum_x86-64.S:SYM_FUNC_START(mpi_add_x86_64)
tls/bignum_x86-64.S:SYM_FUNC_START(mpi_add_mod_p256_x86_64)
tls/bignum_x86-64.S:SYM_FUNC_START(mpi_sub_x86_64_5_4)
tls/bignum_x86-64.S:SYM_FUNC_START(mpi_sub_x86_64_4_4)
tls/bignum_x86-64.S:SYM_FUNC_START(mpi_sub_mod_p256_x86_64)
tls/bignum_x86-64.S:SYM_FUNC_START(mpi_sub_x86_64_3_3)
tls/bignum_x86-64.S:SYM_FUNC_START(mpi_sub_x86_64_2_2)
tls/bignum_x86-64.S:SYM_FUNC_START(mpi_shift_l_x86_64)
tls/bignum_x86-64.S:SYM_FUNC_START(mpi_shift_l_x86_64_4)
tls/bignum_x86-64.S:SYM_FUNC_START(mpi_shift_l1_mod_p256_x86_64)
tls/bignum_x86-64.S:SYM_FUNC_START(mpi_shift_r_x86_64)
tls/bignum_x86-64.S:SYM_FUNC_START(mpi_shift_r_x86_64_4)
tls/bignum_x86-64.S:SYM_FUNC_START(mpi_div2_x86_64_4)
  1. In the new kernel, assembly functions uniformly return from __x86_return_thunk. However, our assembly code uses the original ret instruction, so objtool in the kernel will notice this is a naked return during compilation.
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_cmp_x86_64_4+0x35: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_add_x86_64+0x4a: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_add_mod_p256_x86_64+0x63: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_sub_x86_64+0xc9: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_sub_x86_64_5_4+0x3d: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_sub_x86_64_4_4+0x31: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_sub_mod_p256_x86_64+0x63: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_sub_x86_64_3_3+0x25: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_sub_x86_64_2_2+0x19: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_shift_l_x86_64+0x38: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_shift_l_x86_64_4+0x3f: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_shift_l1_mod_p256_x86_64+0x60: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_shift_r_x86_64+0x25: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_shift_r_x86_64_4+0x25: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_div2_x86_64_4+0x72: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_tpl_mod_p256_x86_64+0x9d: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_mul_x86_64_4+0x17d: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_sqr_x86_64_4+0x112: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: ecp_mod_p256_x86_64+0x1f9: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_mul_int_x86_64_4+0x3f: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_mul_mod_p256_x86_64_4+0x351: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_sqr_mod_p256_x86_64_4+0x2e6: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_from_mont_p256_x86_64+0xf8: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_mul_mont_mod_p256_x86_64+0x227: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: mpi_sqr_mont_mod_p256_x86_64+0x1c1: 'naked' return found in RETHUNK build
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: .rodata+0x10c0: data relocation to !ENDBR: mpi_sub_x86_64+0x9f
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: .rodata+0x10c8: data relocation to !ENDBR: mpi_sub_x86_64+0x90
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: .rodata+0x10d0: data relocation to !ENDBR: mpi_sub_x86_64+0x81
/home/kingluo/tempesta/tls/tempesta_tls.o: warning: objtool: .rodata+0x10d8: data relocation to !ENDBR: mpi_sub_x86_64+0x72
  1. SYM_FUNC_START in the new kernel will add endbr64 to the head of the assembly function, and all indirect jumps to ENDBR instructions, that is, the code snippet within the same function, will fail, but we use jump tables in the assembly function to perform indirect jumps. It will raise CET exception.

When IBT is enabled, an indirect branch (jump, call, return) to any instruction that is not an ENDBR32/64 instruction will cause a #CP exception.

[  680.737353] Missing ENDBR: .sic2lc_len1+0x0/0x4 [tempesta_fw]
[  680.737396] ------------[ cut here ]------------
[  680.737405] kernel BUG at arch/x86/kernel/cet.c:102!
[  680.737417] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[  680.737427] CPU: 3 PID: 2676 Comm: curl Kdump: loaded Tainted: G        W  OE      6.8.9+ #84
[  680.737438] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[  680.737450] RIP: 0010:exc_control_protection+0xbf/0xd0
[  680.737463] Code: 4c 89 e7 e8 d3 2d 1c ff 44 89 f6 4c 89 e7 e8 18 08 00 00 41 5c 41 5d 41 5e 5d c3 cc cc cc cc 49 c7 44 24 50 00 00 00 00 eb de <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 90 90 90 90 90
[  680.737481] RSP: 0018:ffffb302401605b0 EFLAGS: 00010002
[  680.737491] RAX: 0000000000000031 RBX: 0000000000000000 RCX: 0000000000000000
[  680.737500] RDX: 0000000000000000 RSI: ffff94ed3bda1840 RDI: ffff94ed3bda1840
[  680.737509] RBP: ffffb302401605c8 R08: 0000000000000000 R09: ffffb30240160438
[  680.737518] R10: ffffb30240160430 R11: ffffffff9b150728 R12: ffffb302401605d8
[  680.737527] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[  680.737535] FS:  00007f324b92d740(0000) GS:ffff94ed3bd80000(0000) knlGS:0000000000000000
[  680.737544] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  680.737553] CR2: 0000560795bd9a08 CR3: 0000000005d6e000 CR4: 0000000000b50ef0
[  680.737563] Call Trace:
[  680.737571]  <IRQ>
[  680.737578]  ? show_regs+0x6e/0x80
[  680.737589]  ? die+0x3c/0xa0
[  680.737596]  ? do_trap+0xd4/0xf0
[  680.737605]  ? do_error_trap+0x75/0xa0
[  680.737613]  ? exc_control_protection+0xbf/0xd0
[  680.737622]  ? exc_invalid_op+0x57/0x80
[  680.737632]  ? exc_control_protection+0xbf/0xd0
[  680.737640]  ? asm_exc_invalid_op+0x1f/0x30
[  680.737650]  ? exc_control_protection+0xbf/0xd0
[  680.737659]  asm_exc_control_protection+0x2b/0x30
[  680.737667] RIP: 0010:.sic2lc_len1+0x0/0x4 [tempesta_fw]
[  680.737691] Code: 88 40 62 ba c2 32 4e 01 0f b6 c1 09 c2 0f b6 07 0f b6 80 40 62 ba c2 32 06 0f b6 c0 09 d0 c3 cc cc cc cc 31 c0 c3 cc cc cc cc <31> d2 eb df 31 d2 eb c8 31 c0 eb af 31 c0 eb 98 31 d2 eb 81 31 d2
[  680.737883] RSP: 0018:ffffb30240160680 EFLAGS: 00010297
[  680.738036] RAX: ffffffffc0b59ade RBX: 0000000000000001 RCX: 0000000000000003
[  680.738183] RDX: 0000000000000001 RSI: ffffffffc2baa1ef RDI: ffff94ec1e690061
[  680.738379] RBP: ffffb302401606a8 R08: ffffffffc2baa1ef R09: 0000000000000001
[  680.738585] R10: 0000000000000001 R11: ffff94ec1bc5f198 R12: ffff94ec1bc5f1b8
[  680.738788] R13: ffff94ec1bc5f198 R14: ffffffffc2baa1ef R15: ffff94ec1e690061
[  680.738995]  ? .sic2lc_len0+0x7/0x7 [tempesta_fw]
[  680.739214]  ? __try_str+0x50/0xb0 [tempesta_fw]
[  680.739488]  __h2_req_parse_accept+0x31f/0xdb0 [tempesta_fw]
[  680.739705]  tfw_h2_parse_req_hdr_val+0x330/0x76e0 [tempesta_fw]
[  680.739921]  ? tfw_huffman_decode+0x414/0x6d0 [tempesta_fw]
[  680.740135]  tfw_hpack_decode+0x738/0x1f70 [tempesta_fw]
[  680.740345]  tfw_h2_parse_req+0x164/0x270 [tempesta_fw]
[  680.740556]  ss_skb_process+0xf9/0x140 [tempesta_fw]
[  680.740765]  ? __pfx_tfw_h2_parse_req+0x10/0x10 [tempesta_fw]
[  680.740973]  tfw_http_req_process+0x8d/0xaa0 [tempesta_fw]
[  680.741181]  ? alloc_pages_mpol+0x95/0x1f0
[  680.741377]  ? alloc_pages+0x58/0xa0
[  680.741569]  ? __get_free_pages+0x15/0x40
[  680.741761]  ? tfw_pool_alloc_pages+0x7c/0x90 [tempesta_fw]
[  680.741972]  ? __tfw_pool_new+0x31/0x80 [tempesta_fw]
[  680.742173]  tfw_http_msg_process_generic+0x18d/0x6d0 [tempesta_fw]
[  680.742371]  ? ss_skb_chop_head_tail+0xc9/0x1e0 [tempesta_fw]
[  680.742567]  tfw_h2_frame_process+0x4d6/0x6e0 [tempesta_fw]
[  680.742756]  tfw_http_msg_process+0x50/0x60 [tempesta_fw]
[  680.742935]  tfw_connection_recv+0xc3/0x140 [tempesta_fw]
[  680.743112]  tfw_tls_connection_recv+0x31b/0x430 [tempesta_fw]
[  680.743290]  ss_tcp_process_data+0x1e1/0x470 [tempesta_fw]
[  680.743470]  ss_tcp_data_ready+0x83/0x170 [tempesta_fw]
[  680.743646]  tcp_data_ready+0x35/0xe0
[  680.743813]  tcp_data_queue+0x8d5/0xe20
[  680.743976]  tcp_rcv_established+0x244/0x790
[  680.744137]  tcp_v4_do_rcv+0x16a/0x2a0
[  680.744295]  tcp_v4_rcv+0xf01/0xf70
[  680.744451]  ? raw_local_deliver+0xcd/0x240
[  680.744611]  ip_protocol_deliver_rcu+0x37/0x180
[  680.744766]  ip_local_deliver_finish+0x8a/0xb0
[  680.744921]  ip_local_deliver+0x73/0x120
[  680.745075]  ? nf_nat_setup_info+0xb81/0xc20 [nf_nat]
[  680.745235]  ? __pfx_ip_local_deliver_finish+0x10/0x10
[  680.745392]  ip_rcv+0x18f/0x1b0
[  680.745549]  ? __pfx_ip_rcv_finish+0x10/0x10
[  680.745706]  __netif_receive_skb_one_core+0x8a/0xa0
[  680.745715] Missing ENDBR: .sic2lc_len3+0x0/0x4 [tempesta_fw]
[  680.745866]  __netif_receive_skb+0x15/0x60
[  680.746034] ------------[ cut here ]------------
[  680.746194]  process_backlog+0x9a/0x140
[  680.746359] kernel BUG at arch/x86/kernel/cet.c:102!
[  680.746519]  __napi_poll+0x31/0x1d0
[  680.746820]  net_rx_action+0x29d/0x310
[  680.746963]  __do_softirq+0xcd/0x2a5
[  680.747105]  do_softirq.part.0+0x41/0x60
[  680.747246]  </IRQ>
[  680.747382]  <TASK>
[  680.747514]  __local_bh_enable_ip+0x6e/0x70
[  680.747648]  __dev_queue_xmit+0x33d/0xde0
[  680.747801]  ? hash_conntrack_raw+0x6b/0xe0 [nf_conntrack]
[  680.747943]  ip_finish_output2+0x2dc/0x550
[  680.748075]  ? nf_conntrack_in+0xeb/0x6c0 [nf_conntrack]
[  680.748213]  __ip_finish_output+0xb7/0x190
[  680.748344]  ip_finish_output+0x2d/0xe0
[  680.748474]  ip_output+0x63/0xf0
[  680.748603]  ? __pfx_ip_finish_output+0x10/0x10
[  680.748732]  ip_local_out+0x62/0x70
[  680.748860]  __ip_queue_xmit+0x19b/0x4f0
[  680.748988]  ip_queue_xmit+0x19/0x20
[  680.749113]  __tcp_transmit_skb+0xada/0xc90
[  680.749240]  tcp_write_xmit+0x5d0/0x1420
[  680.749367]  __tcp_push_pending_frames+0x3b/0x110
[  680.749494]  tcp_push+0x10c/0x120
[  680.749620]  tcp_sendmsg_locked+0x91a/0xdd0
[  680.749745]  tcp_sendmsg+0x31/0x50
[  680.749868]  inet_sendmsg+0x47/0x80
[  680.749989]  sock_write_iter+0x163/0x190
[  680.750112]  vfs_write+0x38a/0x430
[  680.750235]  ksys_write+0xb9/0xf0
[  680.750354]  __x64_sys_write+0x1d/0x30
[  680.750473]  x64_sys_call+0x1681/0x20c0
[  680.750593]  do_syscall_64+0x54/0x120
[  680.750709]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[  680.750828] RIP: 0033:0x7f324bb14887
[  680.750946] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  680.751213] RSP: 002b:00007ffd5bc22778 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  680.751354] RAX: ffffffffffffffda RBX: 000000000000004a RCX: 00007f324bb14887
[  680.751495] RDX: 000000000000004a RSI: 0000560795bd5953 RDI: 0000000000000005
[  680.751638] RBP: 0000560795ba4ae0 R08: 0000000000000001 R09: 00007f324b6eb340
[  680.751783] R10: 0000000000000068 R11: 0000000000000246 R12: 0000560795bd5953
[  680.751927] R13: 000000000000004a R14: 00007ffd5bc22820 R15: 0000560795ba5c50
[  680.751965] Missing ENDBR: .sic2lc_len3+0x0/0x4 [tempesta_fw]
[  680.752072]  </TASK>
[  680.752186] ------------[ cut here ]------------

Solution

Prefix used with indirect CALL/JMP near instructions (opcodes FF /2 and FF /4) to indicate that the branch target is not required to start with an ENDBR32/64 instruction. Prefix only honored when NO_TRACK_EN flag is set.

As an aside, interestingly, if a user-mode C program uses a switch statement that meets the conditions for generating a jump table (gcc uses -fcf-protection=full by default), the generated jump table will use a jmp with the notrack prefix, and IBT will be marked as true in the .note.gnu.property section of the compiled elf file, so that the NO_TRACK_EN of the MSR will be set to true when the kernel loads and executes this user program. So user mode can use notrack to bypass CET without caring about setting or not setting NO_TRACK_EN.

kingluo commented 3 months ago

4. kernel crash triggered by test cases

4.1 cryptd_queue_worker

This issue is related to fpu. Because fpu save/restore in softirq context has not worked so far, some assumptions of our code are broken:

  1. crypto_aead_encrypt will enter asynchronous mode, but our code expects synchronous mode, when it completes the encryption call, it releases the request immediately, so cryptd may retrieve an invalid request later.
[  145.592329]  cryptd_enqueue_request+0x2f/0xb0 [cryptd]
[  145.592484]  cryptd_aead_encrypt_enqueue+0x40/0x50 [cryptd]
[  145.592642]  crypto_aead_encrypt+0x32/0x60
[  145.592797]  simd_aead_encrypt+0xac/0xd0 [crypto_simd]
[  145.592951]  crypto_aead_encrypt+0x32/0x60
[  145.593105]  ttls_encrypt+0x133/0x230 [tempesta_tls]
[  145.593270]  tfw_tls_encrypt+0x646/0x8c0 [tempesta_fw]
  1. recursive kernel_fpu_start() and kernel_fpu_end()
[  467.122454] WARNING: CPU: 1 PID: 250 at arch/x86/kernel/fpu/core.c:428 __kernel_fpu_begin_mask+0xaa/0xc0
[  467.122643] Modules linked in: tempesta_fw(OE) tempesta_db(OE) tempesta_tls(OE) tempesta_lib(OE) tls xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netli
nk nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay intel_rapl_msr intel_rapl
_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 rapl ppdev e1000 i2c_piix4 parport_pc floppy parport qemu_fw_cfg binfmt_misc sch_fq_co
del dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efi_pstore ip_tables x_tables autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xo
r xor async_tx raid6_pq raid1 raid0 bochs drm_vram_helper drm_ttm_helper ttm input_leds drm_kms_helper psmouse drm serio_raw pata_acpi mac_hid aesni_intel crypto_simd cry
ptd [last unloaded: tempesta_lib(OE)]
[  467.123487] CPU: 1 PID: 250 Comm: jbd2/sda2-8 Kdump: loaded Tainted: G           OE      6.8.9+ #83
[  467.123680] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[  467.123824] RIP: 0010:__kernel_fpu_begin_mask+0xaa/0xc0
[  467.123973] Code: 34 48 8b 5d f8 c9 c3 cc cc cc cc 48 8b 07 f6 c4 40 75 be f0 80 4f 01 40 48 81 c7 c0 15 00 00 e8 ac fc ff ff eb ab db e3 eb c8 <0f> 0b e9 7b ff ff ff
0f 0b eb 82 e8 46 70 e8 00 66 0f 1f 44 00 00
[  467.124293] RSP: 0018:ffffa883c00f8718 EFLAGS: 00010246
[  467.124462] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000001
[  467.124661] RDX: 0000000080000104 RSI: ffff97185233b850 RDI: 0000000000000002
[  467.124895] RBP: ffffa883c00f8730 R08: ffff97185594b5b3 R09: ffff97185038c440
[  467.125155] R10: 000000000000000d R11: 0000000000000000 R12: ffff97185038c440
[  467.125322] R13: ffffa883c00f8830 R14: 0000000000000000 R15: 0000000000000001
[  467.125491] FS:  0000000000000000(0000) GS:ffff97193bc80000(0000) knlGS:0000000000000000
[  467.125665] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  467.125868] CR2: 00007ff3a4a5e000 CR3: 0000000051ae6000 CR4: 0000000000b50ef0
[  467.126049] Call Trace:
[  467.126229]  <IRQ>
[  467.126409]  ? show_regs+0x6e/0x80
[  467.126594]  ? __kernel_fpu_begin_mask+0xaa/0xc0
[  467.126780]  ? __warn+0x91/0x150
[  467.126963]  ? __kernel_fpu_begin_mask+0xaa/0xc0
[  467.127141]  ? report_bug+0x19d/0x1b0
[  467.127323]  ? handle_bug+0x46/0x80
[  467.127510]  ? exc_invalid_op+0x1d/0x80
[  467.127691]  ? asm_exc_invalid_op+0x1f/0x30
[  467.127874]  ? __kernel_fpu_begin_mask+0xaa/0xc0
[  467.128057]  ? __kernel_fpu_begin_mask+0x28/0xc0
[  467.128240]  kernel_fpu_begin_mask+0x19/0x20
[  467.128428]  gcmaes_crypt_by_sg+0xf9/0x390 [aesni_intel]
[  467.128621]  ? get_page_from_freelist+0x137b/0x1580
[  467.128814]  ? scatterwalk_map_and_copy+0x55/0x80
[  467.129007]  gcmaes_encrypt+0x4d/0xa0 [aesni_intel]
[  467.129198]  generic_gcmaes_encrypt+0x55/0x70 [aesni_intel]
[  467.129394]  crypto_aead_encrypt+0x32/0x60
[  467.129585]  simd_aead_encrypt+0x86/0x90 [crypto_simd]
[  467.129778]  crypto_aead_encrypt+0x32/0x60
[  467.130217]  ttls_encrypt+0x133/0x230 [tempesta_tls]
[  467.130423]  tfw_tls_encrypt+0x646/0x8c0 [tempesta_fw]
[  467.130639]  ? __wake_up_common+0x7b/0xa0
[  467.130828]  ? sock_def_readable+0x76/0xd0
[  467.131114]  ? tcp_data_ready+0x35/0xe0
[  467.131297]  ? tcp_rcv_established+0x61c/0x790
[  467.131478]  ? tcp_v4_do_rcv+0x16a/0x2a0
[  467.131655]  ? tfw_h2_stream_fsm+0x79/0x5c0 [tempesta_fw]
[  467.131850]  ? tfw_h2_stream_send_process+0xb0/0x120 [tempesta_fw]
[  467.132036]  ? tfw_sk_prepare_xmit+0x5c9/0x870 [tempesta_fw]
[  467.132215]  tfw_sk_write_xmit+0x5f/0x90 [tempesta_fw]
[  467.132387]  tcp_tfw_sk_write_xmit+0x33/0x80
[  467.132544]  tcp_write_xmit+0x5ad/0x1420
[  467.132693]  __tcp_push_pending_frames+0x3b/0x110
[  467.132835]  tcp_push+0x10c/0x120
[  467.132976]  ss_tx_action+0x512/0x6e0 [tempesta_fw]
[  467.133127]  net_tx_action+0xa1/0x2d0
[  467.133262]  __do_softirq+0xcd/0x2a0
[  467.133389]  __irq_exit_rcu+0x6b/0x90
[  467.133508]  irq_exit_rcu+0x12/0x20
[  467.133625]  sysvec_call_function_single+0x84/0x90
[  467.133741]  </IRQ>
[  467.133880]  <TASK>
[  467.133992]  asm_sysvec_call_function_single+0x1f/0x30
[  467.134106] RIP: 0010:crc_pcl+0x82c/0x12e0
[  467.134221] Code: 0f 38 f1 8a b8 fd ff ff f2 4d 0f 38 f1 93 b8 fd ff ff f3 0f 1e fa f2 4c 0f 38 f1 81 c0 fd ff ff f2 4c 0f 38 f1 8a c0 fd ff ff <f2> 4d 0f 38 f1 93 c0 fd ff ff f3 0f 1e fa f2 4c 0f 38 f1 81 c8 fd
[  467.134464] RSP: 0018:ffffa883c02bbc90 EFLAGS: 00000246
[  467.134587] RAX: 0000000000000080 RBX: 0000000000001000 RCX: ffff97192f424400
[  467.134709] RDX: ffff97192f424800 RSI: 0000000000001000 RDI: 0000000000000000
[  467.134832] RBP: ffffa883c02bbcc8 R08: 00000000587f01d9 R09: 0000000036a3e694
[  467.134956] R10: 000000009a02b44f R11: ffff97192f424c00 R12: ffff97192f424000
[  467.135082] R13: 0000000000001000 R14: ffff97185291d958 R15: 0000000000000000
[  467.135210]  ? crc32c_pcl_intel_update+0xb0/0xd0
[  467.135339]  crypto_shash_update+0x25/0x40
[  467.135468]  jbd2_journal_commit_transaction+0xa3e/0x1980
[  467.135601]  ? lock_timer_base+0x72/0xa0
[  467.135734]  kjournald2+0xaf/0x270
[  467.135864]  ? __pfx_autoremove_wake_function+0x10/0x10
[  467.135996]  ? __pfx_kjournald2+0x10/0x10
[  467.136129]  kthread+0xfb/0x130
[  467.136261]  ? __pfx_kthread+0x10/0x10
[  467.136390]  ret_from_fork+0x40/0x60
[  467.136520]  ? __pfx_kthread+0x10/0x10
[  467.136648]  ret_from_fork_asm+0x1b/0x30
[  467.136779]  </TASK>
0xffffffff810bd740 is in crc32c_pcl_intel_update (arch/x86/crypto/crc32c-intel_glue.c:162).
157              * use faster PCL version if datasize is large enough to
158              * overcome kernel fpu state save/restore overhead
159              */
160             if (len >= CRC32C_PCL_BREAKEVEN && crypto_simd_usable()) {
161                     kernel_fpu_begin();
162                     *crcp = crc_pcl(data, len, *crcp);
163                     kernel_fpu_end();
164             } else
165                     *crcp = crc32c_intel_le_hw(*crcp, data, len);
166             return 0;

workaround

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
modified: crypto/simd.c
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
@@ -317,10 +317,10 @@ static int simd_aead_encrypt(struct aead_request *req)
    subreq = aead_request_ctx(req);
    *subreq = *req;

-   if (!crypto_simd_usable() ||
-       (in_atomic() && cryptd_aead_queued(ctx->cryptd_tfm)))
-       child = &ctx->cryptd_tfm->base;
-   else
+   //if (!crypto_simd_usable() ||
+   //    (in_atomic() && cryptd_aead_queued(ctx->cryptd_tfm)))
+   //  child = &ctx->cryptd_tfm->base;
+   //else
        child = cryptd_aead_child(ctx->cryptd_tfm);

    aead_request_set_tfm(subreq, child);

log

The cryptd worker queue contains invalid list items whose next and prev point to invalid addresses.

[  843.567809] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  843.567949] #PF: supervisor write access in kernel mode
[  843.568058] #PF: error_code(0x0002) - not-present page
[  843.568163] PGD 0 P4D 0
[  843.568302] Oops: 0002 [#1] PREEMPT SMP NOPTI
[  843.568453] CPU: 2 PID: 7190 Comm: kworker/2:0 Kdump: loaded Tainted: G           OE      6.8.9+ #45
[  843.568611] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[  843.568777] Workqueue: cryptd cryptd_queue_worker [cryptd]
[  843.568952] RIP: 0010:crypto_dequeue_request+0x3e/0x60
[  843.569126] Code: 83 e8 01 89 47 18 48 8b 47 10 48 39 f8 74 07 48 8b 00 48 89 47 10 48 be 00 01 00 00 00 00 ad de 48 8b 07 48 8b 08 48 8b 50 08 <48> 89 51 08 48 89 0a 48 89 30 48 83 c6 22 48 89 70 08 5d c3 cc cc
[  843.569489] RSP: 0018:ffffb1724440be20 EFLAGS: 00010246
[  843.569668] RAX: ffff99288c27e850 RBX: ffff9928a4b24780 RCX: 0000000000000000
[  843.569851] RDX: 0000000000000000 RSI: dead000000000100 RDI: ffffd1723fd014e0
[  843.570037] RBP: ffffb1724440be20 R08: ffff9929a0cc48b0 R09: ffff9928a4b24800
[  843.570226] R10: 0000000000000007 R11: 0000000000000007 R12: ffffd1723fd01500
[  843.570418] R13: ffffd1723fd014e0 R14: ffff9929bbd33880 R15: ffff9929a0cc8c05
[  843.570613] FS:  0000000000000000(0000) GS:ffff9929bbd00000(0000) knlGS:0000000000000000
[  843.570815] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  843.571013] CR2: 0000000000000008 CR3: 000000002785a000 CR4: 0000000000b50ef0
[  843.571218] Call Trace:
[  843.571476]  <TASK>
[  843.571685]  ? show_regs+0x6e/0x80
[  843.571895]  ? __die+0x29/0x70
[  843.572099]  ? page_fault_oops+0x160/0x460
[  843.572299]  ? finish_task_switch.isra.0+0x85/0x280
[  843.572498]  ? __schedule+0x37d/0xb30
[  843.572695]  ? do_user_addr_fault+0x2f2/0x6a0
[  843.572889]  ? update_load_avg+0x82/0x7c0
[  843.573080]  ? exc_page_fault+0x7d/0x190
[  843.573267]  ? asm_exc_page_fault+0x2b/0x30
[  843.573454]  ? crypto_dequeue_request+0x3e/0x60
[  843.573638]  cryptd_queue_worker+0xac/0xd0 [cryptd]
[  843.573827]  process_one_work+0x179/0x350
[  843.574009]  ? __pfx_worker_thread+0x10/0x10
[  843.574189]  worker_thread+0x2f7/0x420
[  843.574368]  ? __pfx_worker_thread+0x10/0x10
[  843.574545]  kthread+0xfb/0x130
[  843.574720]  ? __pfx_kthread+0x10/0x10
[  843.574892]  ret_from_fork+0x40/0x60
[  843.575063]  ? __pfx_kthread+0x10/0x10
[  843.575232]  ret_from_fork_asm+0x1b/0x30
[  843.575400]  </TASK>

4.2 ipv6_dup_options

rcu_dereference(tcp_inet6_sk(sk)->opt) is invalid. The reason is unknown yet.

workaround

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
modified: net/ipv6/tcp_ipv6.c
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
@@ -564,8 +564,8 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,

        rcu_read_lock();
        opt = ireq->ipv6_opt;
-       if (!opt)
-           opt = rcu_dereference(np->opt);
+       //if (!opt)
+       //  opt = rcu_dereference(np->opt);
        err = ip6_xmit(sk, skb, fl6, skb->mark ? : READ_ONCE(sk->sk_mark),
                   opt, tclass, READ_ONCE(sk->sk_priority));
        rcu_read_unlock();
@@ -1489,8 +1489,8 @@ static struct sock *tcp_v6_syn_recv_sock(const struct sock *sk, struct sk_buff *
       to newsk.
     */
    opt = ireq->ipv6_opt;
-   if (!opt)
-       opt = rcu_dereference(np->opt);
+   //if (!opt)
+   //  opt = rcu_dereference(np->opt);
    if (opt) {
        opt = ipv6_dup_options(newsk, opt);
        RCU_INIT_POINTER(newnp->opt, opt);

log

[ 3011.280247] BUG: kernel NULL pointer dereference, address: 0000000000000013
[ 3011.280329] #PF: supervisor read access in kernel mode
[ 3011.280397] #PF: error_code(0x0000) - not-present page
[ 3011.280466] PGD 0 P4D 0
[ 3011.280538] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 3011.280611] CPU: 2 PID: 45682 Comm: curl Kdump: loaded Tainted: G           OE      6.8.9+ #45
[ 3011.280692] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 3011.280778] RIP: 0010:ipv6_dup_options+0x18/0xa0
[ 3011.280872] Code: cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 ba 20 08 00 00 48 89 e5 41 54 49 89 f4 53 <8b> 76 04 e8 d0 75 e2 ff 49 89 c0 48 85 c0 74 60 49 63 54 24 04 4c
[ 3011.281079] RSP: 0018:ffffa8c64012cb28 EFLAGS: 00010206
[ 3011.281189] RAX: 0000000000000000 RBX: ffff8c555b0a1480 RCX: 0000000000000000
[ 3011.281321] RDX: 0000000000000820 RSI: 000000000000000f RDI: ffff8c55d2a20000
[ 3011.281439] RBP: ffffa8c64012cb38 R08: 0000000000100000 R09: ffff8c55d2a20070
[ 3011.281555] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000f
[ 3011.281672] R13: ffff8c557378ed68 R14: ffff8c5661fee200 R15: ffff8c55d2a20000
[ 3011.281791] FS:  00007f95a3c32740(0000) GS:ffff8c567bd00000(0000) knlGS:0000000000000000
[ 3011.281915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3011.282039] CR2: 0000000000000013 CR3: 000000008c8d2000 CR4: 0000000000b50ef0
[ 3011.282170] Call Trace:
[ 3011.282300]  <IRQ>
[ 3011.282431]  ? show_regs+0x6e/0x80
[ 3011.282571]  ? __die+0x29/0x70
[ 3011.282703]  ? page_fault_oops+0x160/0x460
[ 3011.282839]  ? ip6_pol_route_output+0x1d/0x30
[ 3011.282975]  ? fib6_rule_lookup+0x12e/0x270
[ 3011.283112]  ? tcp_v6_send_synack+0x15e/0x2a0
[ 3011.283248]  ? do_user_addr_fault+0x2f2/0x6a0
[ 3011.283386]  ? exc_page_fault+0x7d/0x190
[ 3011.283526]  ? asm_exc_page_fault+0x2b/0x30
[ 3011.283668]  ? ipv6_dup_options+0x18/0xa0
[ 3011.283809]  tcp_v6_syn_recv_sock+0x28e/0x840
[ 3011.283951]  tcp_check_req+0x143/0x5b0
[ 3011.284095]  tcp_v6_rcv+0xb8c/0xf70
[ 3011.284238]  ? raw6_local_deliver+0x90/0x250
[ 3011.284384]  ip6_protocol_deliver_rcu+0x72/0x4b0
[ 3011.284532]  ip6_input_finish+0x49/0x70
[ 3011.284680]  ip6_input+0x43/0xe0
[ 3011.284828]  ? nf_hook_slow+0x48/0x100
[ 3011.284978]  ipv6_rcv+0x16e/0x1a0
[ 3011.285128]  ? __pfx_ip6_rcv_finish+0x10/0x10
[ 3011.285281]  __netif_receive_skb_one_core+0x64/0xa0
[ 3011.285437]  __netif_receive_skb+0x15/0x60
[ 3011.285592]  process_backlog+0x9a/0x140
[ 3011.285592]  process_backlog+0x9a/0x140
[ 3011.285748]  __napi_poll+0x31/0x1d0
[ 3011.285904]  net_rx_action+0x29d/0x310
[ 3011.286062]  __do_softirq+0xcd/0x2a0
[ 3011.286221]  do_softirq.part.0+0x41/0x60
[ 3011.286381]  </IRQ>
[ 3011.286537]  <TASK>
[ 3011.286689]  __local_bh_enable_ip+0x6e/0x70
[ 3011.286843]  __dev_queue_xmit+0x33d/0xde0
[ 3011.286997]  ip6_finish_output2+0x310/0x720
[ 3011.287150]  ? ip6_output+0x74/0x140
[ 3011.287301]  ? chacha_block_generic+0x71/0xb0
[ 3011.287453]  ip6_finish_output+0x1fb/0x320
[ 3011.287598]  ip6_output+0x74/0x140
[ 3011.287737]  ? try_to_wake_up+0x81/0x630
[ 3011.287870]  ? ip6_dst_check+0xcd/0xf0
[ 3011.287993]  ip6_xmit+0x42d/0x6a0
[ 3011.288115]  ? ip6_dst_check+0xcd/0xf0
[ 3011.288233]  ? __sk_dst_check+0x41/0xa0
[ 3011.288353]  ? inet6_csk_route_socket+0x123/0x210
[ 3011.288472]  ? lock_timer_base+0x72/0xa0
[ 3011.288592]  inet6_csk_xmit+0xdb/0x140
[ 3011.288708]  __tcp_transmit_skb+0x57e/0xc90
[ 3011.288823]  ? __alloc_skb+0xdd/0x1a0
[ 3011.288934]  __tcp_send_ack.part.0+0xc6/0x1a0
[ 3011.289044]  tcp_send_ack+0x20/0x30
[ 3011.289151]  tcp_rcv_state_process+0x39b/0x1060
[ 3011.289262]  tcp_v6_do_rcv+0x1d6/0x510
[ 3011.289371]  __release_sock+0x72/0xd0
[ 3011.289481]  release_sock+0x34/0xb0
[ 3011.289590]  inet_stream_connect+0x4b/0x60
[ 3011.289700]  __sys_connect_file+0x6a/0x80
[ 3011.289809]  __sys_connect+0xaa/0xe0
[ 3011.289917]  ? do_fcntl+0x1ec/0x680
[ 3011.290027]  __x64_sys_connect+0x1c/0x30
[ 3011.290135]  x64_sys_call+0x1e1d/0x20c0
[ 3011.290246]  do_syscall_64+0x54/0x120
[ 3011.290355]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[   85.586771] BUG: kernel NULL pointer dereference, address: 000000000000001a
[   85.586939] #PF: supervisor read access in kernel mode
[   85.587071] #PF: error_code(0x0000) - not-present page
[   85.587195] PGD 0 P4D 0
[   85.587356] Oops: 0000 [#1] PREEMPT SMP NOPTI
[   85.587481] CPU: 3 PID: 2774 Comm: curl Kdump: loaded Tainted: G           OE      6.8.9+ #70
[   85.587609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   85.587737] RIP: 0010:ip6_xmit+0xc8/0x6a0
[   85.587868] Code: 08 88 44 24 37 0f b7 47 3c 01 c6 49 8b 84 24 d0 00 00 00 49 2b 84 24 c8 00 00 00 83 e6 f0 83 c6 40 48 85 db 0f 84 46 04 00 00 <0f> b7 53 0a 44 0f b7 43 08 89 d1 44 01 c2 44 89 c7 01 d6 39 c6 0f
[   85.588136] RSP: 0018:ffff9e03001608b8 EFLAGS: 00010202
[   85.588297] RAX: 0000000000000118 RBX: 0000000000000010 RCX: 000000000000000a
[   85.588432] RDX: ffff9e0300160b10 RSI: 0000000000000040 RDI: ffff8e60e11a8000
[   85.588568] RBP: ffff9e03001609a8 R08: 0000000000000010 R09: ffff8e5ffdc465a0
[   85.588701] R10: ffff9e0300160a58 R11: 0000000000000028 R12: ffff8e60ea051d00
[   85.588964] R13: ffff8e5ffdc45c40 R14: ffff8e60e877cd00 R15: ffff9e0300160b10
[   85.589144] FS:  00007fe2c7346740(0000) GS:ffff8e60fbd80000(0000) knlGS:0000000000000000
[   85.589277] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   85.589409] CR2: 000000000000001a CR3: 0000000010d30000 CR4: 0000000000b50ef0
[   85.589544] Call Trace:
[   85.589687]  <IRQ>
[   85.589894]  ? show_regs+0x6e/0x80
[   85.590082]  ? __die+0x29/0x70
[   85.590266]  ? page_fault_oops+0x160/0x460
[   85.590453]  ? fib6_table_lookup+0x17a/0x2d0
[   85.590640]  ? do_user_addr_fault+0x2f2/0x6a0
[   85.590836]  ? exc_page_fault+0x7d/0x190
[   85.591022]  ? asm_exc_page_fault+0x2b/0x30
[   85.591210]  ? ip6_xmit+0xc8/0x6a0
[   85.591396]  ? __alloc_skb+0xdd/0x1a0
[   85.591581]  ? tcp_make_synack+0x3cf/0x5f0
[   85.591782]  tcp_v6_send_synack+0x156/0x2a0
[   85.591957]  tcp_conn_request+0xae9/0xd60
[   85.592130]  ? kfree_skbmem+0x52/0xa0
[   85.592300]  tcp_v6_conn_request+0x7b/0xb0
[   85.592466]  ? tempesta_sock_tcp_rcv+0x50/0x60
[   85.592634]  ? tcp_v6_conn_request+0x7b/0xb0
[   85.592856]  tcp_rcv_state_process+0x44d/0x1060
[   85.593034]  ? security_sock_rcv_skb+0x33/0x50
[   85.593207]  ? sk_filter_trim_cap+0x123/0x260
[   85.593368]  tcp_v6_do_rcv+0x1d6/0x510
[   85.593526]  tcp_v6_rcv+0xf2f/0xf70
[   85.593682]  ? raw6_local_deliver+0x12f/0x250
[   85.593855]  ip6_protocol_deliver_rcu+0x72/0x4b0
[   85.594013]  ip6_input_finish+0x49/0x70
[   85.594171]  ip6_input+0x43/0xe0
[   85.594329]  ? nf_hook_slow+0x48/0x100
[   85.594487]  ipv6_rcv+0x16e/0x1a0
[   85.594646]  ? __pfx_ip6_rcv_finish+0x10/0x10
[   85.594805]  __netif_receive_skb_one_core+0x64/0xa0
[   85.594964]  __netif_receive_skb+0x15/0x60
[   85.595120]  process_backlog+0x9a/0x140
[   85.595275]  __napi_poll+0x31/0x1d0
[   85.595429]  net_rx_action+0x29d/0x310
[   85.595583]  __do_softirq+0xcd/0x2a0
[   85.595735]  do_softirq.part.0+0x41/0x60
[   85.595882]  </IRQ>
[   85.596021]  <TASK>
[   85.596149]  __local_bh_enable_ip+0x6e/0x70
[   85.596273]  __dev_queue_xmit+0x33d/0xde0
[   85.596395]  ? mas_spanning_rebalance.isra.0+0x9e8/0x1160
[   85.596516]  ip6_finish_output2+0x310/0x720
[   85.596638]  ip6_finish_output+0x1fb/0x320
[   85.596757]  ip6_output+0x74/0x140
[   85.596899]  ip6_xmit+0x42d/0x6a0
[   85.597016]  ? __find_rr_leaf+0x1f4/0x2a0
[   85.597130]  ? ip6_dst_check+0xcd/0xf0
[   85.597241]  ? __sk_dst_check+0x41/0xa0
[   85.597351]  ? inet6_csk_route_socket+0x123/0x210
[   85.597461]  inet6_csk_xmit+0xdb/0x140
[   85.597570]  __tcp_transmit_skb+0x57e/0xc90
[   85.597680]  ? __mod_memcg_state+0x79/0x100
[   85.597791]  tcp_connect+0xbd4/0x1020
[   85.597900]  ? __pfx_read_tsc+0x10/0x10
[   85.598010]  ? ktime_get_with_offset+0x57/0xd0
[   85.598121]  tcp_v6_connect+0x477/0x660
[   85.598231]  __inet_stream_connect+0xe0/0x3e0
[   85.598342]  ? tomoyo_socket_connect_permission+0x96/0xd0
[   85.598454]  ? tcp_disconnect+0x59f/0x690
[   85.598567]  inet_stream_connect+0x3f/0x60
[   85.598679]  __sys_connect_file+0x6a/0x80
[   85.598791]  __sys_connect+0xaa/0xe0
[   85.598901]  ? do_fcntl+0x1ec/0x680
[   85.599013]  __x64_sys_connect+0x1c/0x30
[   85.599123]  x64_sys_call+0x1e1d/0x20c0
[   85.599235]  do_syscall_64+0x54/0x120
[   85.599345]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[   85.599456] RIP: 0033:0x7fe2c71274f7
[   85.599568] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 18 89 54 24 0c 48 89 34 24 89
[   85.599818] RSP: 002b:00007ffd4f0fcfe8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
[   85.599950] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe2c71274f7
[   85.600083] RDX: 000000000000001c RSI: 00007ffd4f0fd170 RDI: 0000000000000005
[   85.600218] RBP: 00005629778dd160 R08: 0000000000000055 R09: 0000000000000007
[   85.600353] R10: 00007fe2c70075c0 R11: 0000000000000246 R12: 00005629778d9b60
[   85.600489] R13: 0000000000000000 R14: 0000000000000005 R15: 0000000000000000
[   85.600626]  </TASK>

5. userspace segment fault triggered by test cases

h2load combined with the deproxy backend causes a segfault, but not with the nginx backend. The cause is unknown and remains to be found.

Program terminated with signal SIGSEGV, Segmentation fault.                                                                                                               #0  0x00005598a4937ac4 in _PyObject_GC_UNTRACK (op=0x7ff93d13f7f0) at ../Include/internal/pycore_object.h:124                                                             124     ../Include/internal/pycore_object.h: No such file or directory.
[Current thread is 1 (Thread 0x7ff93ee00640 (LWP 18093))]
(gdb) bt
#0  0x00005598a4937ac4 in _PyObject_GC_UNTRACK (op=(' ',)) at ../Include/internal/pycore_object.h:124
#1  PyObject_GC_UnTrack (op_raw=0x7ff93d13f7f0) at ../Modules/gcmodule.c:2255
#2  tupledealloc (op=0x7ff93d13f7f0) at ../Objects/tupleobject.c:271
#3  0x00005598a4940c17 in _Py_Dealloc (op=(' ',)) at ../Objects/object.c:2301
#4  _Py_DECREF (op=(' ',)) at ../Include/object.h:500
#5  method_vectorcall_VARARGS (func=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at ../Objects/descrobject.c:312
#6  0x00005598a494545c in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ff9400e01e8, callable=<method_descriptor at remote 0x7ff942d8c900>,
    tstate=0x5598a6336a40) at ../Include/cpython/abstract.h:114
#7  PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ff9400e01e8, callable=<method_descriptor at remote 0x7ff942d8c900>)                                     at ../Include/cpython/abstract.h:123                                                                                                                                  #8  call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7ff93edfe2e0, tstate=<optimized out>) at ../Python/ceval.c:5893         #9  _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:4198                                             #10 0x00005598a495c9fc in _PyEval_EvalFrame (throwflag=0,                                                                                                                     f=Frame 0x7ff9400e0040, for file /home/kingluo/tempesta-test/helpers/deproxy.py, line 181, in from_stream (rfile=<_io.StringIO at remote 0x7ff93f051750>, no_crlf=False, is_h2=False, headers=<HeaderCollection(headers=[('host', '127.0.0.1'), ('user-agent', 'h2load nghttp2/1.43.0'), ('x-forwarded-for', '127.0.0.1')], is_expected=False, e
xpected_time_delta=0) at remote 0x7ff93cbcdc30>, line='\r\n', name='via', value='1.1 tempesta_fw (Tempesta FW 0.8.0)'), tstate=0x5598a6336a40)
    at ../Include/internal/pycore_ceval.h:46
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005599ed6643ad in dictiter_iternextitem (di=0x7f9f007310d0) at ../Objects/dictobject.c:3940
3940    ../Objects/dictobject.c: No such file or directory.
[Current thread is 1 (Thread 0x7f9f00200640 (LWP 18231))]
(gdb) bt
#0  0x00005599ed6643ad in dictiter_iternextitem (di=0x7f9f007310d0) at ../Objects/dictobject.c:3940
#1  0x00005599ed648650 in _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:4001
#2  0x00005599ed654c14 in _PyEval_EvalFrame (throwflag=0, f=<error reading variable: Cannot access memory at address 0xffff000000000020>, tstate=0x5599ef287920)              at ../Include/internal/pycore_ceval.h:46
#3  _PyEval_Vector (kwnames=0x0, argcount=<optimized out>, args=<optimized out>, locals=0x0, con=0x7f9f02204dd0, tstate=0x5599ef287920) at ../Python/ceval.c:5067
#4  _PyFunction_Vectorcall (kwnames=0x0, nargsf=<optimized out>, stack=<optimized out>, func=<function at remote 0x7f9f02204dc0>) at ../Objects/call.c:342
#5  _PyObject_FastCallDictTstate (tstate=0x5599ef287920, callable=<function at remote 0x7f9f02204dc0>, args=<optimized out>, nargsf=<optimized out>,
    kwargs=<optimized out>) at ../Objects/call.c:142
#6  0x00005599ed669a64 in _PyObject_Call_Prepend (kwargs=0x0, args=(),
    obj=<HeaderCollection(headers=[], is_expected=False, expected_time_delta=0) at remote 0x7f9efce5c130>, callable=<function at remote 0x7f9f02204dc0>,
    tstate=0x5599ef287920) at ../Objects/call.c:431
#7  slot_tp_init (self=self@entry=<HeaderCollection(headers=[], is_expected=False, expected_time_delta=0) at remote 0x7f9efce5c130>, args=args@entry=(),
    kwds=kwds@entry=0x0) at ../Objects/typeobject.c:7734
#8  0x00005599ed655a1c in type_call (kwds=0x0, args=(), type=<optimized out>) at ../Objects/typeobject.c:1135                                                             #9  _PyObject_MakeTpCall (tstate=0x5599ef287920, callable=<type at remote 0x5599ee9aabe0>, args=<optimized out>, nargs=<optimized out>, keywords=0x0)                         at ../Objects/call.c:215                                                                                                                                              #10 0x00005599ed64e096 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7f9f014dc1e8, callable=<optimized out>, tstate=<optimized out>)
    at ../Include/cpython/abstract.h:112
#11 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7f9f014dc1e8, callable=<optimized out>, tstate=<optimized out>)
    at ../Include/cpython/abstract.h:99
#12 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f9f014dc1e8, callable=<optimized out>) at ../Include/cpython/abstract.h:123
#13 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7f9f001fe2e0, tstate=<optimized out>) at ../Python/ceval.c:5893
#14 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:4213
#15 0x00005599ed65f9fc in _PyEval_EvalFrame (throwflag=0,
    f=Frame 0x7f9f014dc040, for file /home/kingluo/tempesta-test/helpers/deproxy.py, line 159, in from_stream (rfile=<_io.StringIO at remote 0x7f9f0044d750>, no_crlf=Fals
e, is_h2=False), tstate=0x5599ef287920) at ../Include/internal/pycore_ceval.h:46
Program terminated with signal SIGSEGV, Segmentation fault.
#0  BaseException_set_tb (_unused_ignored=0x0, tb=0x7fa4a8534e80, self=0x0) at ../Include/object.h:472                                                                    472     ../Include/object.h: No such file or directory.                                                                                                                   [Current thread is 1 (Thread 0x7fa4a8400640 (LWP 18321))]                                                                                                                 (gdb) bt                                                                                                                                                                  #0  BaseException_set_tb (_unused_ignored=0x0, tb=<traceback at remote 0x7fa4a8534e80>, self=0x0) at ../Include/object.h:472                                              #1  PyException_SetTraceback (tb=<traceback at remote 0x7fa4a8534e80>, self=0x0) at ../Objects/exceptions.c:338                                                           #2  _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:4489
#3  0x0000556039dc3afe in _PyEval_EvalFrame (throwflag=0,
    f=Frame 0x7fa4ab9589a0, for file /usr/lib/python3.10/collections/__init__.py, line 983, in __getitem__ (self=<ChainMap(maps=[{}, {'ip': '127.0.0.1', 'ipv6': '::1', 'v
erbose': '1', 'workdir': '/tmp/host', 'duration': '10', 'concurrent_connections': '10', 'log_file': 'tests_log.log', 'stress_threads': '2', 'stress_large_content_length':
 '65536', 'stress_requests_count': '100', 'stress_mtu': '1500', 'long_body_size': '500'}, {}]) at remote 0x7fa4a8524220>, key='verbose', mapping={...}),
    tstate=0x55603d21f510) at ../Include/internal/pycore_ceval.h:46
#4  _PyEval_Vector (kwnames=0x0, argcount=<optimized out>, args=<optimized out>, locals=0x0, con=0x7fa4abda39b0, tstate=0x55603d21f510) at ../Python/ceval.c:5067
#5  _PyFunction_Vectorcall (kwnames=0x0, nargsf=<optimized out>, stack=<optimized out>, func=<function at remote 0x7fa4abda39a0>) at ../Objects/call.c:342
#6  _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=<optimized out>, callable=<function at remote 0x7fa4abda39a0>, tstate=0x55603d21f510)
    at ../Include/cpython/abstract.h:114
#7  vectorcall_unbound (nargs=<optimized out>, args=<optimized out>, func=<optimized out>, unbound=<optimized out>, tstate=<optimized out>)
    at ../Objects/typeobject.c:1629
#8  vectorcall_method (name=<optimized out>, args=<optimized out>, nargs=<optimized out>) at ../Objects/typeobject.c:1661
#9  0x0000556039dc392e in slot_mp_subscript (self=<optimized out>, arg1=<optimized out>) at ../Objects/typeobject.c:7258
#10 0x0000556039d447ae in _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:2109
#11 0x0000556039d697f1 in _PyEval_EvalFrame (throwflag=0,                                                                                                                     f=Frame 0x7fa4a00057d0, for file /usr/lib/python3.10/configparser.py, line 791, in get (self=<ConfigParser(_dict=<type at remote 0x55603a1755c0>, _sections={'General': {'ip': '127.0.0.1', 'ipv6': '::1', 'verbose': '1', 'workdir': '/tmp/host', 'duration': '10', 'concurrent_connections': '10', 'log_file': 'tests_log.log', 'stress_threads': '2', 'stress_large_content_length': '65536', 'stress_requests_count': '100', 'stress_mtu': '1500', 'long_body_size': '500'}, 'Client': {'ip': '127.0.0.2', 'ipv6': '::1', 'hostname': 'localhost', 'ab': 'ab', 'wrk': 'wrk', 'h2load': 'h2load', 'tls-perf': 'tls-perf', 'workdir': '/tmp/client', 'unavaliable_timeout': '300'}, 'Tempesta': {'ip': '127.0.0.1', 'ipv6': '::1', 'hostname': 'localhost', 'user': 'root', 'port': '22', 'srcdir': '/home/kingluo/tempesta', 'workdir': '/tmp/tempesta', 'config': 'tempest
a.conf', 'tmp_config': 'tempesta_tmp.conf', 'unavaliable_timeout': '300'}, 'Server': {'ip': '127.0.0.3', 'ipv6': '::1', 'hostname': 'localhost', 'user': 'root', 'port': '
22', 'ngin...(truncated), tstate=0x55603d21f510) at ../Include/internal/pycore_ceval.h:46
kingluo commented 3 months ago

6. test failures

~6.1 ss_tcp_data_ready(): sk->sk_error_queue not empty~

https://github.com/tempesta-tech/tempesta/blob/de0a8a38027e28095b4bd98f2078c1ed598f66b7/fw/sock.c#L942

test_cached_data_equal_to_original (cache.test_cache.TestChunkedResponse) ... b"tempesta_lib: loading out-of-tree module taints kernel.\ntempesta_lib: module verification failed: signature and/or required key missing - tainting kernel\n[tdb] Start Tempesta DB\n[tempesta fw] Initializing Tempesta FW kernel module...\n[tempesta fw] Warning: Vhost default doesn't have certificate with matching SAN/CN.\n    Maybe that's fine, but it's worth checking the\n    config - if there is no relations between the\n    names, then host name confusion attack is possible.\n[tempesta fw] Configuration processing is completed.\n[tdb] Opened table /opt/tempesta/db/filter0.tdb: size=16777216 rec_size=20 base=00000000f2e6a053\n[tdb] Opened table /opt/tempesta/db/cache0.tdb: size=268435456 rec_size=0 base=00000000356bc22b\n[tdb] Opened table /opt/tempesta/db/sessions0.tdb: size=16777216 rec_size=312 base=0000000087b8ee1a\n[tdb] Opened table /opt/tempesta/db/client0.tdb: size=16777216 rec_size=624 base=00000000b49ee93e\n[tempesta fw] Open listen socket on: 0.0.0.0:443\n[tempesta fw] Open listen socket on: 0.0.0.0\n[tempesta fw] Tempesta FW is ready\n[tempesta fw] ERROR: error data in socket 00000000b7242a90\n[tdb] Close table 'client0.tdb'\n[tdb] Close table 'sessions0.tdb'\n[tdb] Close table 'cache0.tdb'\n[tdb] Close table 'filter0.tdb'\n[tempesta fw] modules are stopped\n[tempesta fw] exiting...\n[tdb] Shutdown Tempesta DB\n"
ERROR
test_h2_cached_data_equal_to_original (cache.test_cache.TestChunkedResponse) ... ok

======================================================================
ERROR: test_cached_data_equal_to_original (cache.test_cache.TestChunkedResponse)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kingluo/tempesta-test/framework/tester.py", line 411, in cleanup_check_dmesg
    raise Exception(f"{err} happened during test on Tempesta")
Exception: ERROR happened during test on Tempesta

solution

In the new kernel, the tx timestamp is looped with the original packet content received, and in our code, an error msg is printed, which fails the test case, but in fact, the test case passes all asserts and is successful.

We should filter out such non-error skb even though it is appended to sk_error_queue.

[  276.898889] Call Trace:
[  276.898893]  <IRQ>
[  276.898898]  dump_stack_lvl+0x70/0x90
[  276.898908]  dump_stack+0x14/0x20
[  276.898914]  ss_tcp_data_ready+0xfe/0x160 [tempesta_fw]
[  276.898937]  tcp_data_ready+0x35/0xe0
[  276.899097]  tcp_data_queue+0x8d5/0xe20
[  276.899235]  tcp_rcv_established+0x244/0x790
[  276.899366]  ? tcp_inbound_hash.constprop.0+0x4e/0x3e0
[  276.899493]  tcp_v4_do_rcv+0x16a/0x2a0
[  276.899613]  tcp_v4_rcv+0xf01/0xf70
[  276.899730]  ? raw_local_deliver+0xcd/0x240
[  276.899847]  ip_protocol_deliver_rcu+0x37/0x180
[  276.899962]  ip_local_deliver_finish+0x8a/0xb0
[  276.900073]  ip_local_deliver+0x73/0x120
[  276.900184]  ? __pfx_ip_local_deliver_finish+0x10/0x10
[  276.900295]  ip_rcv+0x18f/0x1b0
[  276.900408]  ? __pfx_ip_rcv_finish+0x10/0x10
[  276.900518]  __netif_receive_skb_one_core+0x8a/0xa0
[  276.900629]  __netif_receive_skb+0x15/0x60
[  276.900739]  process_backlog+0x9a/0x140
[  276.900843]  __napi_poll+0x31/0x1d0
[  276.900945]  net_rx_action+0x29d/0x310
[  276.901048]  __do_softirq+0xcd/0x2a0
[  276.901151]  do_softirq.part.0+0x41/0x60
[  276.901307]  </IRQ>
[  276.901454]  <TASK>
[  276.901601]  __local_bh_enable_ip+0x6e/0x70
[  276.901751]  __dev_queue_xmit+0x33d/0xde0
[  276.901897]  ? mas_alloc_nodes+0x16a/0x200
[  276.902045]  ? hash_conntrack_raw+0x6b/0xe0 [nf_conntrack]
[  276.902202]  ? __pte_offset_map+0x20/0x190
[  276.902348]  ip_finish_output2+0x2dc/0x550
[  276.902495]  ? nf_conntrack_in+0xeb/0x6c0 [nf_conntrack]
[  276.902644]  __ip_finish_output+0xb7/0x190
[  276.902786]  ip_finish_output+0x2d/0xe0
[  276.902926]  ip_output+0x63/0xf0
[  276.903062]  ? __pfx_ip_finish_output+0x10/0x10
[  276.903199]  ip_local_out+0x62/0x70
[  276.903334]  __ip_queue_xmit+0x19b/0x4f0
[  276.903471]  ? set_ptes.constprop.0+0x2b/0x90
[  276.903605]  ip_queue_xmit+0x19/0x20
[  276.903739]  __tcp_transmit_skb+0xada/0xc90
[  276.903867]  tcp_write_xmit+0x5d0/0x1420
[  276.903992]  __tcp_push_pending_frames+0x3b/0x110
[  276.904114]  tcp_send_fin+0x52/0x190
[  276.904237]  __tcp_close+0x2eb/0x3f0
[  276.904360]  tcp_close+0x29/0xa0
[  276.904482]  inet_release+0x4c/0x90
[  276.904601]  __sock_release+0x40/0xc0
[  276.904716]  sock_close+0x19/0x30
[  276.904827]  __fput+0xa8/0x2f0
[  276.904940]  __fput_sync+0x1e/0x30
[  276.905051]  __x64_sys_close+0x42/0x90
[  276.905162]  x64_sys_call+0x18ea/0x20c0
[  276.905276]  do_syscall_64+0x54/0x120
[  276.905388]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[  276.905503] RIP: 0033:0x7f0e86914f67
[  276.905616] Code: ff e8 0d 16 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 73 ba f7 ff
[  276.905871] RSP: 002b:00007ffea09c97f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  276.906003] RAX: ffffffffffffffda RBX: 0000562c8829ab60 RCX: 00007f0e86914f67
[  276.906136] RDX: 0000000000000006 RSI: 0000000000000006 RDI: 0000000000000006
[  276.906270] RBP: 0000000000000006 R08: 0000000000000000 R09: 0000000000000000
[  276.906515] R10: 00007f0e8680fb40 R11: 0000000000000246 R12: 0000562c8829b950
[  276.906657] R13: 0000000000000000 R14: 00007ffea09c9e70 R15: 0000000000000000
[  276.906797]  </TASK>

~6.2 frang_resp_fwd_process() not called~

Bug: type mismatch

-1 of int type will be cast to 255 of char type, working out an invalid frang index, when compiling with the compile options in the new kernel.

https://github.com/tempesta-tech/tempesta/blob/de0a8a38027e28095b4bd98f2078c1ed598f66b7/fw/gfsm.h#L167

https://github.com/tempesta-tech/tempesta/blob/de0a8a38027e28095b4bd98f2078c1ed598f66b7/fw/gfsm.c#L137

test log

======================================================================
FAIL: test_block_action_attack_reply_not_on_req_rcv_event (http_general.test_block_action.BlockActionReply)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kingluo/tempesta-test/http_general/test_block_action.py", line 204, in test_block_action_attack_reply_not_on_req_rcv_event
    self.check_last_error_response(client, expected_status_code="403")
  File "/home/kingluo/tempesta-test/http_general/test_block_action.py", line 98, in check_last_error_response
    self.assertEqual(client.last_response.status, expected_status_code)
AssertionError: '200' != '403'
- 200
+ 403

======================================================================
FAIL: test_reaching_the_limit_2 (t_frang.test_http_resp_code_block.HttpRespCodeBlockOneClientHttp)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kingluo/tempesta-test/t_frang/test_http_resp_code_block.py", line 146, in test_reaching_the_limit_2
    self.assertTrue(client.wait_for_connection_close())
AssertionError: False is not true

======================================================================
FAIL: test_timeout_invalid (t_frang.test_client_body_and_header_timeout.ClientBodyTimeout)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kingluo/tempesta-test/t_frang/test_client_body_and_header_timeout.py", line 58, in test_timeout_invalid
    self.check_last_response(self.get_client("deproxy-1"), "403", self.error)
  File "/home/kingluo/tempesta-test/t_frang/frang_test_case.py", line 122, in check_last_response
    self.assertEqual(
AssertionError: '200' != '403'
- 200
+ 403
 : HTTP response status codes mismatch.

======================================================================
FAIL: test_timeout_invalid (t_frang.test_client_body_and_header_timeout.ClientBodyTimeoutH2)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kingluo/tempesta-test/t_frang/test_client_body_and_header_timeout.py", line 58, in test_timeout_invalid
    self.check_last_response(self.get_client("deproxy-1"), "403", self.error)
  File "/home/kingluo/tempesta-test/t_frang/frang_test_case.py", line 122, in check_last_response
    self.assertEqual(
AssertionError: '200' != '403'
- 200
+ 403
 : HTTP response status codes mismatch.

======================================================================
FAIL: test_body_len (t_frang.test_length.FrangLengthH2)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kingluo/tempesta-test/t_frang/test_length.py", line 283, in test_body_len
    self.check_response(
  File "/home/kingluo/tempesta-test/t_frang/frang_test_case.py", line 136, in check_response
    self.assertEqual(
AssertionError: '200' != '403'
- 200
+ 403
 : HTTP response status codes mismatch.

======================================================================
FAIL: test_two_clients_two_ip (t_frang.test_request_rate_burst.FrangRequestRateH2)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kingluo/tempesta-test/t_frang/test_request_rate_burst.py", line 109, in test_two_clients_two_ip
    self.assert_reset_socks(self.sniffer.packets)
  File "/home/kingluo/tempesta-test/helpers/asserts.py", line 40, in assert_reset_socks
    self.assertTrue(
AssertionError: False is not true : Ports must be reset: {39561}, but the actual state is: set()

======================================================================
FAIL: test_chunk_cnt_invalid (t_frang.test_http_body_and_header_chunk_cnt.HttpHeaderChunkCnt)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kingluo/tempesta-test/t_frang/test_http_body_and_header_chunk_cnt.py", line 63, in test_chunk_cnt_invalid
    self.check_response(client, "403", self.error)
  File "/home/kingluo/tempesta-test/t_frang/frang_test_case.py", line 136, in check_response
    self.assertEqual(
AssertionError: '200' != '403'
- 200
+ 403
 : HTTP response status codes mismatch.
kingluo commented 2 months ago

7. TSO

local http curl local https curl remote http curl remote https curl
remote nginx :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
local nginx :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:

~7.1 TLS Alert: Bad Record MAC~

This failure exists for both HTTP/1 and HTTP/2:

  1. only happens for parallel HTTP/2 streams or parallel HTTP/1 requests
  2. All requests have been responded to correctly in the socket level, but MAC failed:
    • You can see the MAC failure from Wireshark, where the last part of the decrypted HTTP response (maybe a few hundred bytes) is corrupted.
    • openssl (or other SSL user libraries) returns DECRYPTION_FAILED_OR_BAD_RECORD_MAC error to curl.
  3. This is specific to tempesta and will not fail if the proxy is switched to nginx.
  4. It is present in all tls clients like python httpx and even in other SSL implementations like wolfSSL.
  5. There is no problem in linux-5.10.35-tfw
  6. If the nginx backend does not enable sendfile, everything is okay.
  7. When the backend is placed at a different VM, tempesta crashes.

The reason should be skb patching, which needs further repair.

httpx HTTP/1 client example:

import asyncio
import httpx
import time
import ssl

async def main():
    # context = ssl.SSLContext(ssl.OP_NO_TLSv1_3)
    # context.verify_mode = ssl.CERT_NONE
    async with httpx.AsyncClient(
        limits=httpx.Limits(max_connections=100), verify=False
    ) as client:
        n = 20
        tasks = [client.get(f"https://127.0.0.1/{i}") for i in range(1, n + 1)]
        try:
            result = await asyncio.gather(*tasks, return_exceptions=True)
        except Exception:
            pass
        finally:
            assert len(result) == n
            for res in result:
                if type(res) is ssl.SSLError:
                    print(res)
                    continue
                assert res.status_code == 200
                assert len(res.content) == 65536

if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

test log:

======================================================================
FAIL: test_concurrent_requests (t_stress.test_stress.H2CurlStress)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kingluo/tempesta-test/t_stress/test_stress.py", line 392, in test_concurrent_requests
    self.make_requests("concurrent")
  File "/home/kingluo/tempesta-test/t_stress/test_stress.py", line 356, in make_requests
    self.assertFalse(client.last_response.stderr)
AssertionError: 'curl: (56) OpenSSL SSL_read: error:1C800066:Provider routines::cipher operation failed, errno 0\ncurl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 0

======================================================================
FAIL: test_concurrent_requests (t_stress.test_stress.TlsCurlStress)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kingluo/tempesta-test/t_stress/test_stress.py", line 392, in test_concurrent_requests
    self.make_requests("concurrent")
  File "/home/kingluo/tempesta-test/t_stress/test_stress.py", line 356, in make_requests
    self.assertFalse(client.last_response.stderr)
AssertionError: 'curl: (56) OpenSSL SSL_read: error:1C800066:Provider routines::cipher operation failed, errno 0\ncurl: (56) OpenSSL SSL_read: error:1C800066:Provider routines::cipher operation failed, errno 0

tcpdump sample:

tcpdump.7z.zip


The TLS MAC error is because in new kernel, the shared flag is in the flags field, not the tx_flags field. sendfile makes some skbs received by tempesta have the shared flag, i.e. the page containing the file content is shared by concurrent response TLS encryptions, so the write race will cause data corruption and client decryption failure. We should fix all tx_flags field references in our code, e.g.

@@ -1569,7 +1572,7 @@ ss_skb_to_sgvec_with_new_pages(struct sk_buff *skb, struct scatterlist *sgl,
        int i;

        /* TODO: process of SKBTX_ZEROCOPY_FRAG for MSG_ZEROCOPY */
-       if (skb_shinfo(skb)->tx_flags & SKBFL_SHARED_FRAG) {
+       if (skb_shinfo(skb)->flags & SKBFL_ALL_ZEROCOPY) {
                if (head_data_len) {
                        sg_set_buf(sgl + out_frags, skb->data, head_data_len);
                        out_frags++;

~7.2 remote backend issue~

crash log:

[164267.897887] skbuff: skb_over_panic: text:ffffffff94e71aaa len:34 put:34 head:ffff8e1a0edbe680 data:ffff8e1a0edbe7c0 tail:0x162 end:0x140 dev:<NULL>
[164267.898645] ------------[ cut here ]------------
[164267.898895] kernel BUG at net/core/skbuff.c:195!
[164267.899163] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[164267.899339] CPU: 2 PID: 452489 Comm: curl Kdump: loaded Tainted: G           OE      6.8.9+ #102
[164267.899499] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[164267.899645] RIP: 0010:skb_panic+0x5a/0x60
[164267.900411] Code: c7 c7 c0 38 a0 95 51 8b 88 c4 00 00 00 51 8b 88 c0 00 00 00 51 44 89 d1 ff b0 d0 00 00 00 4c 8b 88 c8 00 00 00 e8 96 aa 4f ff <0f> 0b 0f 1f 40 00 90
 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[164267.900690] RSP: 0000:ffffb307c21f7a00 EFLAGS: 00010246
[164267.900954] RAX: 0000000000000087 RBX: ffff8e1a02d05f00 RCX: 0000000000000000
[164267.901158] RDX: 0000000000000000 RSI: ffff8e1b3bd21840 RDI: ffff8e1b3bd21840
[164267.901384] RBP: ffffb307c21f7a20 R08: 0000000000000000 R09: ffffb307c21f7888
[164267.901595] R10: ffffb307c21f7880 R11: ffffffff95d3fc88 R12: ffff8e1a0edbe480
[164267.901807] R13: 0000000000000586 R14: 0000000000000022 R15: 0000000000000000
[164267.902024] FS:  00007fa8b7496740(0000) GS:ffff8e1b3bd00000(0000) knlGS:0000000000000000
[164267.902234] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[164267.902444] CR2: 0000560e87547f28 CR3: 0000000053eec000 CR4: 0000000000b50ef0
[164267.902656] Call Trace:
[164267.902877]  <TASK>
[164267.903083]  ? show_regs+0x6e/0x80
[164267.903654]  ? die+0x3c/0xa0
[164267.903856]  ? do_trap+0xd4/0xf0
[164267.904080]  ? do_error_trap+0x75/0xa0
[164267.904279]  ? skb_panic+0x5a/0x60
[164267.904475]  ? exc_invalid_op+0x57/0x80
[164267.904879]  ? skb_panic+0x5a/0x60
[164267.905072]  ? asm_exc_invalid_op+0x1f/0x30
[164267.905280]  ? skb_panic+0x5a/0x60
[164267.905477]  skb_put+0x57/0x60
[164267.905668]  skb_split+0x8a/0x330
[164267.905870]  tso_fragment+0x129/0x200
[164267.906174]  tcp_write_xmit+0x569/0x1420
[164267.906361]  __tcp_push_pending_frames+0x3b/0x110
[164267.906550]  tcp_rcv_established+0x677/0x790
[164267.906738]  ? tcp_inbound_hash.constprop.0+0x4e/0x3e0
[164267.906938]  tcp_v4_do_rcv+0x16a/0x2a0
[164267.907125]  tcp_v4_rcv+0xf01/0xf70
[164267.907327]  ? raw_local_deliver+0xcd/0x240
[164267.907522]  ip_protocol_deliver_rcu+0x37/0x180
[164267.907715]  ip_local_deliver_finish+0x8a/0xb0
[164267.907892]  ip_local_deliver+0x73/0x120
[164267.908084]  ? xfrm_alloc_userspi+0x151/0x280 [xfrm_user]
[164267.908298]  ? __pfx_ip_local_deliver_finish+0x10/0x10
[164267.908473]  ip_rcv+0x18f/0x1b0
[164267.908649]  ? __pfx_ip_rcv_finish+0x10/0x10
[164267.908823]  __netif_receive_skb_one_core+0x8a/0xa0
[164267.909005]  __netif_receive_skb+0x15/0x60
[164267.909170]  process_backlog+0x9a/0x140
[164267.909335]  __napi_poll+0x31/0x1d0
[164267.909493]  net_rx_action+0x29d/0x310
[164267.909646]  __do_softirq+0xcd/0x2a5
[164267.909811]  __irq_exit_rcu+0x6b/0x90
[164267.910092]  irq_exit_rcu+0x12/0x20
[164267.910241]  sysvec_call_function_single+0x47/0x90
[164267.910398]  asm_sysvec_call_function_single+0x1f/0x30
[164267.910539] RIP: 0033:0x7fa8b6db6970
[164267.910672] Code: 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 48 85 ff 74 0f 80 27 fe e9 0f ff ff ff 0f 1f 80 00 00 00 00 c3 0f 1f 80 00 00 00 00 <f3> 0f 1e fa 41 57 48
 8d 15 03 4a 24 00 be 04 00 00 00 41 56 41 55
[164267.910954] RSP: 002b:00007fffd1fe1bb8 EFLAGS: 00000206
[164267.911093] RAX: 0000560e873407e0 RBX: 00007fa8b7521f56 RCX: 0000000000000000
[164267.911234] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[164267.911374] RBP: 0000000000000000 R08: 0000560e873099a0 R09: 0000000000000000
[164267.911513] R10: 00000000e7b82203 R11: fbe77e4b322fe13a R12: 0000560e873407e0
[164267.911652] R13: 0000000000000004 R14: 00007fffd1fe1c80 R15: 00007fa8b6d9bd20
[164267.911790]  </TASK>

two-machines-test crashes because tcp_write_xmit expects paged-only skb (so skb_split in tso_fragment panics because skb contains linear data. In contrast, tcp_sendmsg copy data only to frags, not to skb->head. This is a breaking change in new kernels. Why does our skb (response) contain linear data? Because our resposes come from the backend (net_rx_action, not tcp_sendmsg), the packet may contain large enough linear data to trigger tso_fragment. Moreover, our tfw_tls_encrypt assumes that skb contains linear data, so this is a conflict.

The solution is to consider linear data in the send path, just like the old kernel.