tempesta-tech / tempesta

All-in-one solution for high performance web content delivery and advanced protection against DDoS and web attacks
https://tempesta-tech.com/
GNU General Public License v2.0
616 stars 103 forks source link

Crash in APM #752

Closed krizhanovsky closed 7 years ago

krizhanovsky commented 7 years ago

From https://github.com/tempesta-tech/tempesta/issues/692#issuecomment-308987475 . The problem seems happen at https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/apm.c#L223 , i.e. cnt isn't incremented in the loop above.

I got the oops at the below using workload ./wrk -t 16 -c 4096 -d 3600s http://centos:80/ in VM which also runs Apache HTTPD with connection reset each 100 requests.

    [ 4512.704257] divide error: 0000 [#1] SMP
    [ 4512.705113] Modules linked in: tfw_sched_rr(O) tfw_sched_http(O) tfw_sched_hash(O) tempesta_fw(O) tempesta_db(O) tempesta_tls(O) nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_vs nf_conntrack dm_mirror dm_region_hash dm_log dm_mod xfs libcrc32c ppdev ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd input_leds led_class pcspkr i2c_piix4 i2c_core parport_pc parport acpi_cpufreq uinput ip_tables btrfs xor raid6_pq ata_generic pata_acpi crc32c_intel serio_raw e1000 ata_piix floppy ipv6 crc_ccitt autofs4
    [ 4512.711182] CPU: 0 PID: 12929 Comm: httpd Tainted: G           O    4.8.15-tfw #41
    [ 4512.711182] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
    [ 4512.711182] task: ffff8800357cc500 task.stack: ffff880023b1c000
    [ 4512.711182] RIP: 0010:[<ffffffffa068aea7>]  [<ffffffffa068aea7>] __tfw_stats_adjust+0x67/0x250 [tempesta_fw]
    [ 4512.711182] RSP: 0018:ffff88007fc03898  EFLAGS: 00010246
    [ 4512.711182] RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff88005b964688
    [ 4512.711182] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88005b964608
    [ 4512.735428] RBP: ffff88007fc038c8 R08: 0000000000000010 R09: 0000000000000000
    [ 4512.735428] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    [ 4512.735428] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    [ 4512.735428] FS:  00007f07c8d3e880(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
    [ 4512.735428] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 4512.735428] CR2: 00007f9a699ee640 CR3: 0000000026e40000 CR4: 00000000003406f0
    [ 4512.735428] Stack:
    [ 4512.735428]  0000000000000000 ffff88005b964608 0000000000000625 ffff88005b964008
    [ 4512.735428]  0000000000000000 0000000000000000 ffff88007fc038f8 ffffffffa068bddf
    [ 4512.735428]  ffff88001f78c608 ffff8800216ce020 ffff8800265da020 ffffffffa0693660
    [ 4512.735428] Call Trace:
    [ 4512.735428]  <IRQ> 
    [ 4512.735428]  [<ffffffffa068bddf>] tfw_apm_update+0x2cf/0x3c0 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa0693660>] ? tfw_http_send_500.isra.28+0xd0/0xd0 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa06936b2>] tfw_http_resp_cache_cb+0x52/0x80 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa068e157>] tfw_cache_process+0x117/0x2a0 [tempesta_fw]
    [ 4512.735428]  [<ffffffff810f0ce4>] ? __getnstimeofday64+0x94/0x140
    [ 4512.735428]  [<ffffffffa0693e8d>] tfw_http_resp_cache+0x7d/0xc0 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa06942bc>] tfw_http_msg_process+0x1fc/0x750 [tempesta_fw]
    [ 4512.735428]  [<ffffffff8154ab4d>] ? __alloc_skb+0x4d/0x2a0
    [ 4512.735428]  [<ffffffffa069161c>] __gfsm_fsm_exec+0x5c/0x90 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa0691844>] tfw_gfsm_dispatch+0x14/0x20 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa0690e3d>] tfw_connection_recv+0x1d/0x20 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa06a84fc>] ss_tcp_process_data+0x17c/0x360 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa06a8730>] ss_tcp_data_ready+0x50/0x90 [tempesta_fw]
    [ 4512.735428]  [<ffffffff815c68a6>] tcp_data_queue+0x6e6/0xd60
    [ 4512.735428]  [<ffffffff815c70fe>] tcp_rcv_established+0x1de/0x710
    [ 4512.735428]  [<ffffffff815d2bdc>] tcp_v4_do_rcv+0x10c/0x210
    [ 4512.735428]  [<ffffffff815d41bb>] tcp_v4_rcv+0xadb/0xcd0
    [ 4512.735428]  [<ffffffff815aa6b1>] ? ip_local_deliver_finish+0x41/0x3b0
    [ 4512.735428]  [<ffffffff815aa7a3>] ip_local_deliver_finish+0x133/0x3b0
    [ 4512.735428]  [<ffffffff815aa6b1>] ? ip_local_deliver_finish+0x41/0x3b0
    [ 4512.735428]  [<ffffffff815ab2d0>] ip_local_deliver+0xc0/0xd0
    [ 4512.735428]  [<ffffffff815aa670>] ? inet_del_offload+0x40/0x40
    [ 4512.735428]  [<ffffffff815aabcf>] ip_rcv_finish+0x1af/0x610
    [ 4512.735428]  [<ffffffff815ab5c9>] ip_rcv+0x2e9/0x410
    [ 4512.735428]  [<ffffffff815aaa20>] ? ip_local_deliver_finish+0x3b0/0x3b0
    [ 4512.735428]  [<ffffffff8155dd85>] __netif_receive_skb_core+0x3a5/0xb40
    [ 4512.735428]  [<ffffffff8155e660>] ? process_backlog+0xe0/0x230
    [ 4512.735428]  [<ffffffff8155e53d>] __netif_receive_skb+0x1d/0x60
    [ 4512.735428]  [<ffffffff8155e5f8>] process_backlog+0x78/0x230
    [ 4512.735428]  [<ffffffff8155e660>] ? process_backlog+0xe0/0x230
    [ 4512.735428]  [<ffffffff815607ce>] net_rx_action+0x14e/0x440
    [ 4512.735428]  [<ffffffff81653ca8>] __do_softirq+0xc8/0x473
    [ 4512.735428]  [<ffffffff815aeed7>] ? ip_finish_output2+0x217/0x5c0
    [ 4512.735428]  [<ffffffff8102066e>] __tempesta_do_softirq_fpusafe+0xe/0x10
    [ 4512.735428]  [<ffffffff816521ec>] do_softirq_own_stack+0x1c/0x30
    [ 4512.735428]  <EOI> 
    [ 4512.735428]  [<ffffffff8107065e>] do_softirq+0x4e/0x60
    [ 4512.735428]  [<ffffffff81070718>] __local_bh_enable_ip+0xa8/0xb0
    [ 4512.735428]  [<ffffffff815aef00>] ip_finish_output2+0x240/0x5c0
    [ 4512.735428]  [<ffffffff815afd46>] ? ip_finish_output+0x1c6/0x320
    [ 4512.735428]  [<ffffffff815afd46>] ip_finish_output+0x1c6/0x320
    [ 4512.735428]  [<ffffffff815afea9>] NF_HOOK_COND.part.33.constprop.47+0x9/0x10
    [ 4512.735428]  [<ffffffff815b1563>] ip_output+0xc3/0x110
    [ 4512.735428]  [<ffffffff815afb80>] ? ip_fragment.constprop.49+0x80/0x80
    [ 4512.735428]  [<ffffffff815b0a19>] ip_local_out+0x39/0x70
    [ 4512.735428]  [<ffffffff815b0e4b>] ip_queue_xmit+0x1eb/0x5a0
    [ 4512.735428]  [<ffffffff815b0c65>] ? ip_queue_xmit+0x5/0x5a0
    [ 4512.735428]  [<ffffffff815ca9f4>] tcp_transmit_skb+0x4a4/0x8b0
    [ 4512.735428]  [<ffffffff815caf61>] tcp_write_xmit+0x161/0xef0
    [ 4512.735428]  [<ffffffff815cbfd1>] __tcp_push_pending_frames+0x31/0xd0
    [ 4512.735428]  [<ffffffff815bd16b>] do_tcp_setsockopt.isra.37+0x6ab/0x860
    [ 4512.735428]  [<ffffffff815bd34c>] tcp_setsockopt+0x2c/0x30
    [ 4512.735428]  [<ffffffff815401a4>] sock_common_setsockopt+0x14/0x20
    [ 4512.735428]  [<ffffffff8153ef11>] SyS_setsockopt+0x71/0xc0
    [ 4512.735428]  [<ffffffff810029c8>] do_syscall_64+0x58/0x110
    [ 4512.735428]  [<ffffffff816513da>] entry_SYSCALL64_slow_path+0x25/0x25
    [ 4512.735428] Code: 48 63 11 49 83 c3 01 48 01 d0 4c 63 09 4d 39 d1 76 06 4c 63 11 4d 89 c5 49 83 c0 01 48 83 c1 04 49 83 f8 10 75 d2 48 01 c0 31 d2 <49> f7 f3 4c 39 d0 72 0f 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 
    [ 4512.735428] RIP  [<ffffffffa068aea7>] __tfw_stats_adjust+0x67/0x250 [tempesta_fw]
    [ 4512.735428]  RSP <ffff88007fc03898>
    [ 4512.852777] ---[ end trace 630a16268dd4e293 ]---
    [ 4512.854608] Kernel panic - not syncing: Fatal exception in interrupt
krizhanovsky commented 7 years ago

Got somewhat different crash in APM after running

    # ulimit -n 32768; ./wrk -t 16 -c 4096 -d 3600s http://centos:80/

I used Linux patch https://github.com/tempesta-tech/linux-4.8.15-tfw/commit/9cea1ec0156145217ac6320c40189b40b8160780 and #692 -mitigating patch

--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3191,6 +3191,12 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
                if (!fully_acked)
                        break;

+               if (!skb->next) {
+                       pr_err("AK_DBG peer=%x:%x state=%u sk_refcnt=%d flags=%lx\n",
+                               sk->sk_daddr, sk->sk_dport, sk->sk_state,
+                               atomic_read(&sk->sk_refcnt), sk->sk_flags);
+                       continue;
+               }
                tcp_unlink_write_queue(skb, sk);
                sk_wmem_free_skb(sk, skb);
                if (unlikely(skb == tp->retransmit_skb_hint))

Tempesta master commit b60cc3d884f9d838362c6ce8bb64503d06fa9b22) is used. The crash trace is

    [ 4307.593595] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 44s! [swapper/1:0]
    [ 4307.593595] Modules linked in: tfw_sched_ratio(O) tfw_sched_http(O) tfw_sched_hash(O) tempesta_fw(O) tempesta_db(O) tempesta_tls(O) nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_vs nf_conntrack dm_mirror dm_region_hash dm_log dm_mod ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw xfs libcrc32c ppdev gf128mul ablk_helper cryptd input_leds led_class pcspkr parport_pc parport i2c_piix4 i2c_core acpi_cpufreq uinput ip_tables btrfs xor raid6_pq ata_generic pata_acpi crc32c_intel serio_raw e1000 ata_piix floppy ipv6 crc_ccitt autofs4
    [ 4307.593595] irq event stamp: 8232141
    [ 4307.593595] hardirqs last  enabled at (8232140): [<ffffffff81651d92>] restore_regs_and_iret+0x0/0x1d
    [ 4307.593595] hardirqs last disabled at (8232141): [<ffffffff81652ce4>] apic_timer_interrupt+0x84/0x90
    [ 4307.593595] softirqs last  enabled at (8229960): [<ffffffff8106f631>] _local_bh_enable+0x21/0x50
    [ 4307.593595] softirqs last disabled at (8229961): [<ffffffff81070828>] irq_exit+0x98/0xb0
    [ 4307.593595] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O    4.8.15-tfw #45
    [ 4307.593595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
    [ 4307.593595] task: ffff88005c8e1700 task.stack: ffff88005c8e8000
    [ 4307.593595] RIP: 0010:[<ffffffffa068fb82>]  [<ffffffffa068fb82>] tfw_apm_prcntl_tmfn+0x7a2/0x970 [tempesta_fw]
    [ 4307.593595] RSP: 0018:ffff88007fd03dc0  EFLAGS: 00000216
    [ 4307.593595] RAX: fc25fb3500000004 RBX: ffff88005946d600 RCX: 000000009b2945f3
    [ 4307.593595] RDX: 000000000000ff52 RSI: 000000000000ff52 RDI: 00000000fc9dfb35
    [ 4307.593595] RBP: ffff88007fd03e30 R08: 000000009b2945f4 R09: 00000000fc25fb35
    [ 4307.593595] R10: 000000000000fb35 R11: 0000000000000001 R12: ffff88005a3cbd90
    [ 4307.593595] R13: 0000000000000000 R14: 0000000000000015 R15: ffffe8ffffd04930
    [ 4307.593595] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
    [ 4307.593595] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 4307.593595] CR2: 00007f99b61de000 CR3: 00000000220d2000 CR4: 00000000003406e0
    [ 4307.593595] Stack:
    [ 4307.593595]  ffff88007fd03e30 ffff88005946d000 000000015c8e1700 ffff88005946d180
    [ 4307.593595]  ffffffff810be849 ffff88005c8e1700 0000000000000246 0000000000000101
    [ 4307.593595]  ffff88007fd03e20 ffff88005946d110 ffff88005946d110 0000000000000101
    [ 4307.593595] Call Trace:
    [ 4307.593595]  <IRQ> 
    [ 4307.593595]  [<ffffffff810be849>] ? __lock_is_held+0x49/0x70
    [ 4307.593595]  [<ffffffffa068f3e0>] ? tfw_apm_stats+0x90/0x90 [tempesta_fw]
    [ 4307.593595]  [<ffffffff810e677e>] call_timer_fn+0x9e/0x310
    [ 4307.593595]  [<ffffffff810e66e5>] ? call_timer_fn+0x5/0x310
    [ 4307.593595]  [<ffffffffa068f3e0>] ? tfw_apm_stats+0x90/0x90 [tempesta_fw]
    [ 4307.593595]  [<ffffffff810e76b9>] run_timer_softirq+0x1c9/0x620
    [ 4307.593595]  [<ffffffff810f07a6>] ? ktime_get+0x96/0x120
    [ 4307.593595]  [<ffffffff810f7454>] ? clockevents_program_event+0x44/0x110
    [ 4307.593595]  [<ffffffff81653d68>] __do_softirq+0xc8/0x473
    [ 4307.593595]  [<ffffffff81070828>] irq_exit+0x98/0xb0
    [ 4307.593595]  [<ffffffff81653a22>] smp_apic_timer_interrupt+0x42/0x50
    [ 4307.593595]  [<ffffffff81652ce9>] apic_timer_interrupt+0x89/0x90
    [ 4307.593595]  <EOI> 
    [ 4307.593595]  [<ffffffff81028085>] ? default_idle+0x25/0x180
    [ 4307.593595]  [<ffffffff81028083>] ? default_idle+0x23/0x180
    [ 4307.593595]  [<ffffffff810289ef>] arch_cpu_idle+0xf/0x20
    [ 4307.593595]  [<ffffffff810b5563>] default_idle_call+0x33/0x50
    [ 4307.593595]  [<ffffffff810b57ba>] cpu_startup_entry+0x23a/0x420
    [ 4307.593595]  [<ffffffff8104311b>] start_secondary+0x10b/0x130
    [ 4307.593595] Code: 05 da cd e0 48 8b 45 98 f0 ff 80 00 01 00 00 4c 89 e7 e8 f2 16 fc e0 e9 e1 fa ff ff 48 89 c7 8d 48 01 48 c1 ef 20 49 89 f9 eb 03 <44> 89 c1 bf 0f 00 00 00 44 8d 41 01 d3 e7 44 01 cf 44 0f b7 d7 
    [ 4307.593595] Kernel panic - not syncing: softlockup: hung tasks
    [ 4307.593595] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O L  4.8.15-tfw #45
    [ 4307.593595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
    [ 4307.593595]  0000000000000000 ffff88007fd03b78 ffffffff8135ecbe ffff88005c8e1700
    [ 4307.593595]  ffffffff819dcb0e ffff88007fd03bf0 ffffffff8117783e 0000000000000008
    [ 4307.593595]  ffff88007fd03c00 ffff88007fd03ba0 0000000000000001 0000000000000000
    [ 4307.593595] Call Trace:
    [ 4307.593595]  <IRQ>  [<ffffffff8135ecbe>] dump_stack+0x67/0x99
    [ 4307.593595]  [<ffffffff8117783e>] panic+0xda/0x220
    [ 4307.593595]  [<ffffffff81131100>] watchdog_timer_fn+0x1d0/0x1d0
    [ 4307.593595]  [<ffffffff81130f30>] ? watchdog+0x40/0x40
    [ 4307.593595]  [<ffffffff810e9ba0>] __hrtimer_run_queues+0x100/0x500
    [ 4307.593595]  [<ffffffff810ea38b>] hrtimer_interrupt+0xab/0x1b0
    [ 4307.593595]  [<ffffffff81044ae5>] local_apic_timer_interrupt+0x35/0x60
    [ 4307.593595]  [<ffffffff81653a1d>] smp_apic_timer_interrupt+0x3d/0x50
    [ 4307.593595]  [<ffffffff81652ce9>] apic_timer_interrupt+0x89/0x90
    [ 4307.593595]  [<ffffffffa068fb82>] ? tfw_apm_prcntl_tmfn+0x7a2/0x970 [tempesta_fw]
    [ 4307.593595]  [<ffffffff810be849>] ? __lock_is_held+0x49/0x70
    [ 4307.593595]  [<ffffffffa068f3e0>] ? tfw_apm_stats+0x90/0x90 [tempesta_fw]
    [ 4307.593595]  [<ffffffff810e677e>] call_timer_fn+0x9e/0x310
    [ 4307.593595]  [<ffffffff810e66e5>] ? call_timer_fn+0x5/0x310
    [ 4307.593595]  [<ffffffffa068f3e0>] ? tfw_apm_stats+0x90/0x90 [tempesta_fw]
    [ 4307.593595]  [<ffffffff810e76b9>] run_timer_softirq+0x1c9/0x620
    [ 4307.593595]  [<ffffffff810f07a6>] ? ktime_get+0x96/0x120
    [ 4307.593595]  [<ffffffff810f7454>] ? clockevents_program_event+0x44/0x110
    [ 4307.593595]  [<ffffffff81653d68>] __do_softirq+0xc8/0x473
    [ 4307.593595]  [<ffffffff81070828>] irq_exit+0x98/0xb0
    [ 4307.593595]  [<ffffffff81653a22>] smp_apic_timer_interrupt+0x42/0x50
    [ 4307.593595]  [<ffffffff81652ce9>] apic_timer_interrupt+0x89/0x90
    [ 4307.593595]  <EOI>  [<ffffffff81028085>] ? default_idle+0x25/0x180
    [ 4307.593595]  [<ffffffff81028083>] ? default_idle+0x23/0x180
    [ 4307.593595]  [<ffffffff810289ef>] arch_cpu_idle+0xf/0x20
    [ 4307.593595]  [<ffffffff810b5563>] default_idle_call+0x33/0x50
    [ 4307.593595]  [<ffffffff810b57ba>] cpu_startup_entry+0x23a/0x420
    [ 4307.593595]  [<ffffffff8104311b>] start_secondary+0x10b/0x130

The oops corresponds to following code in APM:

   # nm tempesta_fw.ko|grep tfw_apm_prcntl_tmfn
   00000000000013e0 t tfw_apm_prcntl_tmfn

   # addr2line -e tempesta_fw.ko -i `perl -e 'printf("%x", 0x13e0+0x7a2)'`
   /root/tempesta/tempesta_fw/apm.c:180
   /root/tempesta/tempesta_fw/apm.c:353
   /root/tempesta/tempesta_fw/apm.c:924
keshonok commented 7 years ago

APM has been reworked in #753. Bugs that caused the crashes were eliminated.