imq / linuximq

Pseudo-driver for the intermediate queue device.
https://imq.github.io/
GNU General Public License v2.0
99 stars 51 forks source link

System hangs #86

Closed KaloNK closed 4 years ago

KaloNK commented 4 years ago

We have replaced our core router with a new machine (+kernel and linux version) and now it hangs from time time. It happens mostly when the script updates the shapers i.e. when tc adds qdiscs and classes to the imq device. Usually there is nothing in the logs and the changes are quite a lot, so it was difficult to narrow it down, but this time we got a call trace in the log and looks like imq (related) kernel bug:

[2663308.555346] rcu: INFO: rcu_sched self-detected stall on CPU [2663308.555351] rcu: #0119-...!: (60000 ticks this GP) idle=02e/1/0x4000000000000002 softirq=1406852195/1406852195 fqs=0 last_accelerate: 86b2/7112, nonlazy_posted: 0, .. [2663308.555351] rcu: #011 (t=60000 jiffies g=2405964101 q=974) [2663308.555354] rcu: rcu_sched kthread starved for 60000 jiffies! g2405964101 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=8 [2663308.555354] rcu: RCU grace-period kthread stack dump: [2663308.555355] rcu_sched I 0 10 2 0x80000000 [2663308.555356] Call Trace: [2663308.555363] ? schedule+0x225/0x850 [2663308.555364] schedule+0x32/0x80 [2663308.555365] schedule_timeout+0x176/0x360 [2663308.555368] ? next_timer_interrupt+0xd0/0xd0 [2663308.555372] rcu_gp_kthread+0x404/0x830 [2663308.555373] ? sync_sched_exp_handler+0xa0/0xa0 [2663308.555376] kthread+0x117/0x130 [2663308.555377] ? kthread_create_worker_on_cpu+0x70/0x70 [2663308.555378] ret_from_fork+0x35/0x40 [2663308.555382] Sending NMI from CPU 9 to CPUs 1: [2663308.555451] NMI backtrace for cpu 1 [2663308.555452] CPU: 1 PID: 17 Comm: ksoftirqd/1 Tainted: G O 4.19.75 #1 [2663308.555452] Hardware name: Supermicro Super Server/X11SPM-TPF, BIOS 3.0c 03/27/2019 [2663308.555452] RIP: 0010:queued_spin_lock_slowpath+0x138/0x1a0 [2663308.555453] Code: 85 c9 74 37 c7 41 08 01 00 00 00 65 ff 0d 7c 4c f2 6b c3 89 c2 30 e6 a9 00 00 ff ff 75 6b 85 d2 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f3 90 48 8b 0a 48 85 [2663308.555453] RSP: 0018:ffff9f138033ba80 EFLAGS: 00000202 [2663308.555453] RAX: 0000000000280101 RBX: 0000000000000001 RCX: 0000000000000000 [2663308.555454] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff920469443080 [2663308.555454] RBP: 0000000000000000 R08: ffff9204ce589990 R09: 000000000003d08f [2663308.555454] R10: 00000000fffffffe R11: 0000000000000001 R12: ffff92043a8f9900 [2663308.555454] R13: ffff920469443000 R14: ffff92043b60d000 R15: ffff920469445000 [2663308.555455] FS: 0000000000000000(0000) GS:ffff9204ef840000(0000) knlGS:0000000000000000 [2663308.555455] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [2663308.555455] CR2: 00007f8159cb3900 CR3: 00000002d2a0a006 CR4: 00000000007606e0 [2663308.555455] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [2663308.555456] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [2663308.555456] PKRU: 55555554 [2663308.555456] Call Trace: [2663308.555456] imq_nf_queue+0x3ab/0x780 [imq] [2663308.555456] ? ipt_do_table+0x35c/0x620 [ip_tables] [2663308.555456] imq_nf_queue+0x1f3/0x30b [imq] [2663308.555457] ? kmem_cache_alloc_trace+0x43/0x1d0 [2663308.555457] nf_queue+0x158/0x2e0 [2663308.555457] nf_hook_slow+0x8a/0xf0 [2663308.555457] ip_output+0xd7/0xf0 [2663308.555457] ? __ip_flush_pending_frames.isra.0+0x80/0x80 [2663308.555458] ip_forward+0x3fe/0x460 [2663308.555458] ? ip_defrag.cold+0x37/0x37 [2663308.555458] ip_rcv+0xb5/0xc0 [2663308.555458] ? ip_rcv_finish_core.isra.0+0x380/0x380 [2663308.555458] netif_receive_skb_one_core+0x53/0x70 [2663308.555459] netif_receive_skb_internal+0x55/0x100 [2663308.555459] napi_gro_receive+0x5a/0xf0 [2663308.555459] +0x8dd/0x15d0 [i40e]

I have found this bug ( https://bugzilla.kernel.org/show_bug.cgi?id=205025 ) which looks similar, but it is for 5.3 while my kernel is 4.19.75 I have found this link ( https://stackoverflow.com/questions/51236711/is-it-safe-to-use-rcu-dereference-inside-local-bh-disable-local-bh-enable ), which states that when rcu_read_lock_bh is used rcu_dereference_bh should be used, while in imq there is "q = rcu_dereference(txq->qdisc);" ... can this be the cause for the hang?

davidkek commented 4 years ago

Hi KaloNK You will have other problem test shaper Up/Down from Customer Internet to Customer Shaper Work fine. But Customer to Customer Shaper work only on Download on Upload not work.

Test is : Cuscomer A : 2m/2m Customer B : 5m/5m

Download/Upload test From Customer A to Customer B Upload : 5mbit (Upload shape by shaper from Cuscormer B not use Shaper for Customer A) Download: 2Mbit

KaloNK commented 4 years ago

Hi davidkek, i know that. We use IMQ for several years and there is dedicated IMQ for the local traffic. When using IMQ for local traffic the speed is matching the speed of the initiator i.e. : if Customer A is downloading from Customer B - the speed is 2m if Customer B is downloading from Customer A - the speed is 5m The problem is not with the IMQ use, but with the system hanging

davidkek commented 4 years ago

but can you give an example with a local imq. For system hanging may be k0ste or MihaiC need to help

KaloNK commented 4 years ago

This is not related to the issue can we move this to mail

KaloNK commented 4 years ago

The freeze is triggered from UDP, with TCP it works fine. If the shapers are on bond0 (uncomment the line) it also works OK, so the problem is only with IMQ and UDP. Attached is a script to reproduce it. reproduce.txt

We have used iperf to generate traffic between the two hosts and noticed that even on bond0 if the UDP stream is bigger than the shaper all packets are dropped, so it is probably some kind of DOS protection from the kernel involved.

EDIT: The problem is present with TCP too, so it is not UDP only problem

KaloNK commented 4 years ago

Tested different kernel versions in addition to 4.19 for the problem: 5.4 - present 4.14 - present 4.4 - seems to work OK 4.9 - present so the braking change is between 4.4 and 4.9

h3lll1 commented 4 years ago

I think problem occurs some time after 4.9.65 version, because i used to have this kernel, and there were no system hangs. After that i have jumped to 4.9.105, and started to have problems.

KaloNK commented 4 years ago

Thanks!!! I have tested 4.9.206 and got the hangs ... 4.9.65 - 4.9.105 is just 6 or 7 compilations to pinpoint the exact version, so will try to get that info this year ;)

EDIT: unfortunately even 4.9.65 hangs EDIT2: 4.7.10 seems to work OK, but 4.8.17 (and 4.8.8) does not compile and stops with error in net/core ... checking for the last minor version in 4.8 that works

KaloNK commented 4 years ago

The problem starts with 4.8.0, while 4.7.10 works OK. Looking at the patch (no changelog file) there are quite a lot of changes to rcu and cpu task scheduler code and net/core+netfilter too, but nothing obvious so far

h3lll1 commented 4 years ago

Looks like i cannot even test 4.4.207 (latest), patching gives me errors. Happy new year!

KaloNK commented 4 years ago

use the patch for 4.4.32 and change: qh = rcu_dereference(queue_handler); to qh = rcu_dereference(net->nf.queue_handler); in both places

KaloNK commented 4 years ago

Changing the locks in net/core/skbuff.c to "spin_lock_irqsave(&skb_cb_store_lock, flags);" fixes the problem

EDIT: 72h test passed. Attached is the patch with the changes made skbuff.diff.txt

haegar commented 4 years ago

I was having quite similar problems, also in 4.19.XX (last tries with 4.19.93) and thanks @KaloNK, your lock investigations seems toi point in the correct direction.

But I tried a slightly different solution: get rid of this spinlock completely, and (since skb_copy_stored_cb is unused) also get rid of the useless refcnt atomic_t handling. As it was the locking does not make sense, the atomic_dec_and_test() plus kmem_cache_free() does not need to be locked, the only thing where a lock could make sense is preventing a double-free and protecting the cb_next/linked list modifications, which the code did not do, and without using skb_copy_stored_cb the refcnt was only ever initialized to one and then decremented and freed.

(skb_copy_stored_cb was also terminally broken, as it only increased the refcnt for the newest linked list entry, and not for the whole chain as would have been needed to safely share the list)

And afterwards I found basically the same thing was already done by someone else: the kernel patch linux-4.0-imq.diff already dropped this lock, ChangeLog.txt contains "2015-06-24 Feng Gao removed usless skb_cb_store_lock of 4.0 patch. It will be merged into other patches if it is ok." - but this was never merged into the 4.1 patch, and I assume all future versions were based on that.

Attached my change for 4.19, as a patch to be applied to the 4.19.x kernel tree after applying linux-4.16-imq.diff linux-imq-drop-skb_cb_store_lock.diff.txt

So far this survived around 24 hours in production at one of the worst affected sites, where before the uptime of a single server (them running in failover) rarely was longer than 1-2 hours, plus around 10 hours in my load-test environment where I had problems to exactly reproduce these crashes.

KaloNK commented 4 years ago

I haven't noticed that skb_copy_stored_cb is in fact unused and your solution seems better. Will make the changes and run the tests again, as the crashes are reliably reproduced here

EDIT: seems stable so far - attaching the full patch for 4.19.94 and 5.4.10 so they can be used as a base for the future. linux-4.19.94-imq.diff.txt linux-5.4.10-imq.diff.txt

Will confirm on Monday if the machine survives the 72h test

EDIT2: Test PASSED with ~2GB of traffic and shapers restart every 10 to 20 sec

# uptime 
 16:12:29 up 3 days, 4 min,  2 users,  load average: 0.02, 0.10, 0.11
imq commented 4 years ago

Thanks @KaloNK for the patch and investigations. I will test them also and publish them

h3lll1 commented 4 years ago

This is the solution! Thank you @haegar

k0ste commented 4 years ago

@KaloNK PR with this fix?

imq commented 4 years ago

I did merged it now.

haegar commented 4 years ago

@imq the commited linux-4.19-imq.diff is missing the removal of the atomic_t refcnt from struct skb_cb_table, but includes the wanted removal of the atomic handling, making this field unused now.

linux-5.4-imq.patch correctly removes it.

junnanx commented 3 years ago

hi everyone

I also met the ipt_do_table similar problem which rcu_sched detected CPU stall in ipt_do_table function at linux-4.19. The detail call stack as following. I want to know this issues can be solved after 2021-04-16 commit, right?

2021-04-16 | netfilter: x_tables: fix compat match/target pad out-of-bound write netfilter: x_tables: fix compat match/target pad out-of-bound write commit b29c457a6511435960115c0f548c4360d5f4801d upstream.

xt_compat_match/target_from_user doesn't check that zeroing the area to start of next rule won't write past end of allocated ruleset blob.

Remove this code and zero the entire blob beforehand.

Reported-by: syzbot+cfc0247ac173f597aaaa@syzkaller.appspotmail.com Reported-by: Andy Nguyen theflow@google.com Fixes: 9fa492cdc160c ("[NETFILTER]: x_tables: simplify compat API") Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

[ 65.692391] <1>-(1)[14070:dbus-daemon]rcu: [name:tree&]INFO: rcu_sched self-detected stall on CPU [ 65.695564] <1>-(1)[14070:dbus-daemon]rcu: [name:tree&] 1-....: (1 GPs behind) idle=3ca/1/0x4000000000000002 softirq=6277/6287 fqs=1049 [ 65.697110] <1>-(1)[14070:dbus-daemon]rcu: [name:tree&] (t=2100 jiffies g=3745 q=71718) [ 65.698137] <1>-(1)[14070:dbus-daemon][name:core&]Task dump for CPU 1: [ 65.698965] <1>-(1)[14070:dbus-daemon]dbus-daemon R running task 0 14070 12179 0x00400002 [ 65.700144] <1>-(1)[14070:dbus-daemon]Call trace: [ 65.700746] <1>-(1)[14070:dbus-daemon] dump_backtrace+0x0/0x150 [ 65.701498] <1>-(1)[14070:dbus-daemon] show_stack+0x24/0x30 [ 65.702207] <1>-(1)[14070:dbus-daemon] sched_show_task+0xf0/0x124 [ 65.702982] <1>-(1)[14070:dbus-daemon] dump_cpu_task+0x7c/0x9c [ 65.703723] <1>-(1)[14070:dbus-daemon] rcu_dump_cpu_stacks+0xf0/0xfc [ 65.704529] <1>-(1)[14070:dbus-daemon] rcu_check_callbacks+0x39c/0x880 [ 65.705358] <1>-(1)[14070:dbus-daemon] update_process_times+0x34/0x5c [ 65.706176] <1>-(1)[14070:dbus-daemon] tick_sched_handle.isra.4+0x54/0x60 [ 65.707036] <1>-(1)[14070:dbus-daemon] tick_sched_timer+0x50/0xc0 [ 65.707809] <1>-(1)[14070:dbus-daemon] hrtimer_run_queues+0x1d4/0x2b0 [ 65.708649] <1>-(1)[14070:dbus-daemon] hrtimer_interrupt+0x118/0x278 [ 65.709456] <1>-(1)[14070:dbus-daemon] arch_timer_handler_phys+0x38/0x50 [ 65.710307] <1>-(1)[14070:dbus-daemon] handle_percpu_devid_irq+0xf0/0x204 [ 65.711169] <1>-(1)[14070:dbus-daemon] generic_handle_irq+0x2c/0x44 [ 65.711963] <1>-(1)[14070:dbus-daemon] handle_domain_irq+0xb8/0xbc [ 65.712770] <1>-(1)[14070:dbus-daemon] gic_handle_irq+0xd0/0x178 [ 65.713532] <1>-(1)[14070:dbus-daemon] el1_irq+0xe8/0x180 [ 65.714223] <1>-(1)[14070:dbus-daemon] ipt_do_table+0x2dc/0x698 [ip_tables] [ 65.715106] <1>-(1)[14070:dbus-daemon] 0xffffff8001008030 [ 65.715792] <1>-(1)[14070:dbus-daemon] nf_hook_slow+0x58/0xc8 [ 65.716523] <1>-(1)[14070:dbus-daemon] ip_local_out+0xb4/0xc8 [ 65.717275] <1>-(1)[14070:dbus-daemon] ip_local_out+0x34/0x68 [ 65.718004] <1>-(1)[14070:dbus-daemon] ip_queue_xmit+0x29c/0x31c [ 65.718788] <1>-(1)[14070:dbus-daemon] ip_queue_xmit+0x10/0x18 [ 65.719530] <1>-(1)[14070:dbus-daemon] tcp_transmit_skb+0x750/0x91c [ 65.720347] <1>-(1)[14070:dbus-daemon] tcp_transmit_skb+0x40/0x50 [ 65.721121] <1>-(1)[14070:dbus-daemon] tcp_write_xmit+0x704/0xc7c [ 65.721894] <1>-(1)[14070:dbus-daemon] tcp_push_pending_frames+0x5c/0xf8 [ 65.722765] <1>-(1)[14070:dbus-daemon] tcp_push+0xac/0x130 [ 65.723462] <1>-(1)[14070:dbus-daemon] tcp_sendmsg_locked+0x4cc/0xa74 [ 65.724278] <1>-(1)[14070:dbus-daemon] tcp_sendmsg+0x40/0x60 [ 65.724997] <1>-(1)[14070:dbus-daemon] inet_sendmsg+0xb8/0xe8 [ 65.725728] <1>-(1)[14070:dbus-daemon] socksendmsg+0x4c/0x68 [ 65.726457] <1>-(1)[14070:dbus-daemon] sys_sendmsg+0x278/0x2cc [ 65.727230] <1>-(1)[14070:dbus-daemon] sys_sendmsg+0x60/0x90 [ 65.727971] <1>-(1)[14070:dbus-daemon] __arm64_compat_sys_sendmsg+0x2c/0x38 [ 65.728854] <1>-(1)[14070:dbus-daemon] el0_svc_common.constprop.0+0x94/0x114 [ 65.729746] <1>-(1)[14070:dbus-daemon] el0_svc_compat_handler+0x2c/0x38 [ 65.730584] <1>-(1)[14070:dbus-daemon] el0_svc_compat+0x8/0x34

Best Regards Xu Junnan