multipath-tcp / mptcp

⚠️⚠️⚠️ Deprecated 🚫 Out-of-tree Linux Kernel implementation of MultiPath TCP. 👉 Use https://github.com/multipath-tcp/mptcp_net-next repo instead ⚠️⚠️⚠️
https://github.com/multipath-tcp/mptcp_net-next
Other
888 stars 336 forks source link

mptcp_sock_def_error_report+0xfa/0x110 #315

Closed knxhm closed 5 years ago

knxhm commented 5 years ago

Hi I found messages like below in my Kernel log on 2 different machines which are running similar traffic with lots of parallel MPTCP connections. Dont know whats the implications and what is about to happen when the load further increases.

[Sun Feb 10 17:45:50 2019] ------------[ cut here ]------------
[Sun Feb 10 17:45:50 2019] WARNING: CPU: 1 PID: 0 at net/mptcp/mptcp_ctrl.c:632 mptcp_sock_def_error_report+0xfa/0x110
[Sun Feb 10 17:45:50 2019] Modules linked in: ipt_REJECT nf_reject_ipv4 ipt_MASQUERADE nf_nat_masquerade_ipv4 binfmt_misc xt_multiport ip6table_filter ip6_tables xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_comment xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 tun xt_conntrack nf_conntrack libcrc32c iptable_filter sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper evdev cryptd snd_pcm snd_timer snd vmw_balloon soundcore serio_raw pcspkr sg vmw_vsock_vmci_transport vsock vmwgfx ttm shpchp drm_kms_helper drm vmw_vmci ac button mptcp_rr mptcp_balia ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sr_mod cdrom sd_mod ata_generic ata_piix crc32c_intel psmouse vmw_pvscsi libata vmxnet3 scsi_mod i2c_piix4 floppy
[Sun Feb 10 17:45:50 2019] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.14.79+ #1
[Sun Feb 10 17:45:50 2019] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/03/2018
[Sun Feb 10 17:45:50 2019] task: ffff880236e52100 task.stack: ffffc90000cc8000
[Sun Feb 10 17:45:50 2019] RIP: 0010:mptcp_sock_def_error_report+0xfa/0x110
[Sun Feb 10 17:45:50 2019] RSP: 0018:ffff88023fc83df8 EFLAGS: 00010292
[Sun Feb 10 17:45:50 2019] RAX: 000000000000003c RBX: ffff8801a93c6b80 RCX: 000000000000083f
[Sun Feb 10 17:45:50 2019] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
[Sun Feb 10 17:45:50 2019] RBP: ffff880190ddf100 R08: 0000000000000000 R09: 00000000000011d9
[Sun Feb 10 17:45:50 2019] R10: 000000009481c8f5 R11: 0000000000000001 R12: ffff880176308700
[Sun Feb 10 17:45:50 2019] R13: ffffffff820d8c80 R14: ffff8801a93c6b80 R15: ffff88023fc9a830
[Sun Feb 10 17:45:50 2019] FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
[Sun Feb 10 17:45:50 2019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sun Feb 10 17:45:50 2019] CR2: 00007f816259a358 CR3: 000000000200a004 CR4: 00000000001606e0
[Sun Feb 10 17:45:50 2019] Call Trace:
[Sun Feb 10 17:45:50 2019]  <IRQ>
[Sun Feb 10 17:45:50 2019]  tcp_write_err+0x37/0x160
[Sun Feb 10 17:45:50 2019]  tcp_write_timeout+0x74/0x420
[Sun Feb 10 17:45:50 2019]  tcp_retransmit_timer+0x2aa/0x4d0
[Sun Feb 10 17:45:50 2019]  ? tcp_write_timer_handler+0x240/0x240
[Sun Feb 10 17:45:50 2019]  mptcp_sub_retransmit_timer+0xe/0xc0
[Sun Feb 10 17:45:50 2019]  tcp_write_timer_handler+0xe3/0x240
[Sun Feb 10 17:45:50 2019]  tcp_write_timer+0xbc/0xc0
[Sun Feb 10 17:45:50 2019]  call_timer_fn+0x30/0x130
[Sun Feb 10 17:45:50 2019]  run_timer_softirq+0x1d6/0x420
[Sun Feb 10 17:45:50 2019]  ? tick_sched_handle+0x23/0x60
[Sun Feb 10 17:45:50 2019]  ? tick_sched_timer+0x34/0x70
[Sun Feb 10 17:45:50 2019]  __do_softirq+0xf8/0x29d
[Sun Feb 10 17:45:50 2019]  irq_exit+0xb6/0xc0
[Sun Feb 10 17:45:50 2019]  smp_apic_timer_interrupt+0x78/0x140
[Sun Feb 10 17:45:50 2019]  apic_timer_interrupt+0x7d/0x90
[Sun Feb 10 17:45:50 2019]  </IRQ>
[Sun Feb 10 17:45:50 2019] RIP: 0010:native_safe_halt+0x2/0x10
[Sun Feb 10 17:45:50 2019] RSP: 0018:ffffc90000ccbed8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[Sun Feb 10 17:45:50 2019] RAX: ffffffff81633360 RBX: ffff880236e52100 RCX: 0000000000000000
[Sun Feb 10 17:45:50 2019] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[Sun Feb 10 17:45:50 2019] RBP: 0000000000000001 R08: 000558f42d693883 R09: ffffc90001037a68
[Sun Feb 10 17:45:50 2019] R10: 0000000000000000 R11: 0000000000000001 R12: ffff880236e52100
[Sun Feb 10 17:45:50 2019] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[Sun Feb 10 17:45:50 2019]  ? ldsem_down_write+0x210/0x210
[Sun Feb 10 17:45:50 2019]  default_idle+0x18/0xf0
[Sun Feb 10 17:45:50 2019]  do_idle+0x163/0x1d0
[Sun Feb 10 17:45:50 2019]  cpu_startup_entry+0x5f/0x70
[Sun Feb 10 17:45:50 2019]  start_secondary+0x19e/0x1e0
[Sun Feb 10 17:45:50 2019]  secondary_startup_64+0xa5/0xb0
[Sun Feb 10 17:45:50 2019] Code: 8b 41 0f b6 74 24 20 4c 8b 45 60 48 c7 c7 c8 3e dc 81 89 f2 89 f1 40 c0 ee 06 c0 ea 07 83 e1 01 83 e6 01 0f b6 d2 e8 11 02 ae ff <0f> 0b e9 5a ff ff ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00
[Sun Feb 10 17:45:50 2019] ---[ end trace 0b55cb9292ee0017 ]---

Pls have a look on this.

Regards Franz

matttbe commented 5 years ago

Hi Franz,

Thank you for this bug report.

By chance, are you able to reproduce this warning with the latest mptcp_v0.94 based on the kernel v4.14.94 with extra fixes?

knxhm commented 5 years ago

Thanks Matt, I will do that but it may take some (longer) time, it is a production system and the messages appear in irregular intervals.

kpeletidis commented 5 years ago

I saw this error on production systems too. Usually a reboot follows.

I have a 4.14.79 kernel with 6ece8f4d (Correctly initialize IPv6 fields) as the latest commit. I'll try to retest with that one when possible.

Kostas

On Tue, 12 Feb 2019, 10:15 knxhm <notifications@github.com wrote:

Thanks Matt, I will do that but it may take some (longer) time, it is a production system and the messages appear in irregular intervals.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/multipath-tcp/mptcp/issues/315#issuecomment-462700453, or mute the thread https://github.com/notifications/unsubscribe-auth/ADgmo7vI6-Hc9cDp2aQ811fiMj4biRRDks5vMpRSgaJpZM4a06i9 .

cpaasch commented 5 years ago

I doubt that latest v0.94 has a fix for this scenario. We might need more debug-info. What seems to be happening is that after a fallback we still have a lingering subflow whose timer is expiring. Might be worth it to not only print the meta-socket's state but also more info about subflows. That would help us narrow the issue down.

knxhm commented 5 years ago

Christoph, would you want to prepare a debug patch on top of the latest 4.14.94+ to get more debug-info ?

kpeletidis commented 5 years ago

I have also seen this issue with a very recent v0.94 kernel (Linux-4.14.105 plus MPTCP commits up to and including 3f01458).

@cpaasch are there any particular fields of struct mptcp_tcp_sock that you would like to see traced in the warning? Will it be too slow to walk the linked list of mptcp_tcp_sock elements and trace a few fields from each?

cpaasch commented 5 years ago

You can apply the below patch:

I found two more cases of mptcp_sock_def_error_report with syzkaller that I am currently trying to root-cause. Might be the same issue.

From 7f8c657d5d003dce31d162635f91238757541550 Mon Sep 17 00:00:00 2001
From: Christoph Paasch <cpaasch@apple.com>
Date: Wed, 13 Mar 2019 09:37:01 -0700
Subject: [PATCH] DEBUG mptcp_sock_def_error_report

---
 include/net/mptcp.h      | 6 +++---
 net/ipv4/tcp.c           | 9 ++++++++-
 net/ipv4/tcp_ipv4.c      | 2 ++
 net/ipv6/tcp_ipv6.c      | 2 ++
 net/mptcp/mptcp_ctrl.c   | 7 ++++++-
 net/mptcp/mptcp_input.c  | 5 +++--
 net/mptcp/mptcp_output.c | 1 +
 7 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index cf56185687c0..70fea582911a 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -1255,11 +1255,11 @@ static inline bool mptcp_fallback_infinite(struct sock *sk, int flag)
    if (mpcb->infinite_mapping_snd)
        return false;

-   pr_err("%s %#x will fallback - pi %d, src %pI4:%u dst %pI4:%u rcv_nxt %u from %pS\n",
-          __func__, mpcb->mptcp_loc_token, tp->mptcp->path_index,
+   pr_err("%s %p %#x will fallback - pi %d, src %pI4:%u dst %pI4:%u rcv_nxt %u from %pS master %u\n",
+          __func__, mpcb, mpcb->mptcp_loc_token, tp->mptcp->path_index,
           &inet_sk(sk)->inet_saddr, ntohs(inet_sk(sk)->inet_sport),
           &inet_sk(sk)->inet_daddr, ntohs(inet_sk(sk)->inet_dport),
-          tp->rcv_nxt, __builtin_return_address(0));
+          tp->rcv_nxt, __builtin_return_address(0), is_master_tp(tp));
    if (!is_master_tp(tp)) {
        MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_FBACKSUB);
        return true;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9b01e689fe34..9828f82e8dba 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2398,8 +2398,11 @@ int tcp_disconnect(struct sock *sk, int flags)
    int err = 0;
    int old_state = sk->sk_state;

-   if (old_state != TCP_CLOSE)
+   if (old_state != TCP_CLOSE) {
+   if (is_meta_sk(sk) && tp->mpcb->infinite_mapping_rcv)
+       pr_err("%s closing %p\n", __func__, tp->mpcb);
        tcp_set_state(sk, TCP_CLOSE);
+   }

    /* ABORT function of RFC793 */
    if (old_state == TCP_LISTEN) {
@@ -3637,11 +3640,15 @@ EXPORT_SYMBOL(tcp_md5_hash_key);
 void tcp_done(struct sock *sk)
 {
    struct request_sock *req = tcp_sk(sk)->fastopen_rsk;
+   struct tcp_sock *tp = tcp_sk(sk);

    if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
        TCP_INC_STATS(sock_net(sk), TCP_MIB_ATTEMPTFAILS);

    WARN_ON(sk->sk_state == TCP_CLOSE);
+
+   if (is_meta_sk(sk) && tp->mpcb->infinite_mapping_rcv)
+       pr_err("%s closing %p from %pS\n", __func__, tp->mpcb, __builtin_return_address(0));
    tcp_set_state(sk, TCP_CLOSE);

    tcp_clear_xmit_timers(sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 52054cf187ac..f39436e66b4d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -266,6 +266,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
     * This unhashes the socket and releases the local port,
     * if necessary.
     */
+   if (is_meta_sk(sk) && tp->mpcb->infinite_mapping_rcv)
+       pr_err("%s closing %p\n", __func__, tp->mpcb);
    tcp_set_state(sk, TCP_CLOSE);
    ip_rt_put(rt);
    sk->sk_route_caps = 0;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index af375a34f07b..13904e0f20ec 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -313,6 +313,8 @@ int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
    return 0;

 late_failure:
+   if (is_meta_sk(sk) && tp->mpcb->infinite_mapping_rcv)
+       pr_err("%s closing %p\n", __func__, tp->mpcb);
    tcp_set_state(sk, TCP_CLOSE);
 failure:
    inet->inet_dport = 0;
diff --git a/net/mptcp/mptcp_ctrl.c b/net/mptcp/mptcp_ctrl.c
index a16a4d6e68a1..d5f3485c7e36 100644
--- a/net/mptcp/mptcp_ctrl.c
+++ b/net/mptcp/mptcp_ctrl.c
@@ -641,7 +641,8 @@ static void mptcp_sock_def_error_report(struct sock *sk)
            meta_sk->sk_error_report(meta_sk);

        WARN(meta_sk->sk_state == TCP_CLOSE,
-            "Meta already closed i_rcv %u i_snd %u send_i %u flags %#lx\n",
+            "Meta %p already closed i_rcv %u i_snd %u send_i %u flags %#lx\n",
+            mpcb,
             mpcb->infinite_mapping_rcv, mpcb->infinite_mapping_snd,
             mpcb->send_infinite_mapping, meta_sk->sk_flags);

@@ -1755,6 +1756,8 @@ void mptcp_close(struct sock *meta_sk, long timeout)
    if (data_was_unread) {
        /* Unread data was tossed, zap the connection. */
        NET_INC_STATS(sock_net(meta_sk), LINUX_MIB_TCPABORTONCLOSE);
+   if (meta_tp->mpcb->infinite_mapping_rcv)
+       pr_err("%s closing %p\n", __func__, meta_tp->mpcb);
        tcp_set_state(meta_sk, TCP_CLOSE);
        tcp_sk(meta_sk)->ops->send_active_reset(meta_sk,
                            meta_sk->sk_allocation);
@@ -1834,6 +1837,8 @@ void mptcp_close(struct sock *meta_sk, long timeout)

    if (meta_sk->sk_state == TCP_FIN_WAIT2) {
        if (meta_tp->linger2 < 0) {
+   if (meta_tp->mpcb->infinite_mapping_rcv)
+       pr_err("%s closing linger %p\n", __func__, meta_tp->mpcb);
            tcp_set_state(meta_sk, TCP_CLOSE);
            meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
            __NET_INC_STATS(sock_net(meta_sk),
diff --git a/net/mptcp/mptcp_input.c b/net/mptcp/mptcp_input.c
index aa21ccf6f814..c2301648aaac 100644
--- a/net/mptcp/mptcp_input.c
+++ b/net/mptcp/mptcp_input.c
@@ -541,8 +541,8 @@ static int mptcp_prevalidate_skb(struct sock *sk, struct sk_buff *skb)
     */
    if (!tp->mptcp->fully_established && !mptcp_is_data_seq(skb) &&
        !tp->mptcp->mapping_present && !mpcb->infinite_mapping_rcv) {
-       pr_err("%s %#x will fallback - pi %d from %pS, seq %u\n",
-              __func__, mpcb->mptcp_loc_token,
+       pr_err("%s %p %#x will fallback - pi %d from %pS, seq %u\n",
+              __func__, mpcb, mpcb->mptcp_loc_token,
               tp->mptcp->path_index, __builtin_return_address(0),
               TCP_SKB_CB(skb)->seq);

@@ -690,6 +690,7 @@ static int mptcp_detect_mapping(struct sock *sk, struct sk_buff *skb)
    /* Mapping not yet set on this subflow - we set it here! */

    if (!data_len) {
+   pr_err("%s %p falling back\n", __func__, mpcb);
        mpcb->infinite_mapping_rcv = 1;
        mpcb->send_infinite_mapping = 1;
        tp->mptcp->fully_established = 1;
diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
index 6f368e935950..621ac3eaa399 100644
--- a/net/mptcp/mptcp_output.c
+++ b/net/mptcp/mptcp_output.c
@@ -471,6 +471,7 @@ static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
        !before(tcb->seq, mptcp_meta_tp(tp)->snd_nxt)) {
        tp->mptcp->fully_established = 1;
        tp->mpcb->infinite_mapping_snd = 1;
+       pr_err("%s %p sending infinite\n", __func__, tp->mpcb);
        tp->mptcp->infinite_cutoff_seq = tp->write_seq;
        tcb->mptcp_flags |= MPTCPHDR_INF;
    }
-- 
2.16.2
kpeletidis commented 5 years ago

Kernel stability has improved a lot with the latest kernel, thanks. I finally saw one more instance of this warning with a kernel with Christoph's patch.

Mar 19 08:08:50 homeagent4 kernel: [127714.432601] TCP: tcp_ack resetting flow
Mar 19 08:08:57 homeagent4 kernel: [127721.924829] mptcp_prevalidate_skb ffff888191cf3100 0xc0cfbede will fallback - pi 1 from tcp_data_queue+0x35b/0x4b0, seq 3461279010
Mar 19 08:08:57 homeagent4 kernel: [127722.066026] TCP: tcp_done closing ffff888191cf3100 from tcp_ack+0xb63/0x12c0
Mar 19 08:08:58 homeagent4 kernel: [127722.433338] mptcp_prevalidate_skb ffff88818276ea00 0xbd97d288 will fallback - pi 1 from tcp_data_queue+0x35b/0x4b0, seq 4252167839
Mar 19 08:08:58 homeagent4 kernel: [127722.637573] TCP: tcp_done closing ffff88818276ea00 from tcp_ack+0xb63/0x12c0
Mar 19 08:08:58 homeagent4 kernel: [127722.689861] Meta ffff888181101180 already closed i_rcv 1 i_snd 1 send_i 0 flags 0x2000303
Mar 19 08:08:58 homeagent4 kernel: [127722.689884] ------------[ cut here ]------------
Mar 19 08:08:58 homeagent4 kernel: [127722.689890] WARNING: CPU: 0 PID: 0 at net/mptcp/mptcp_ctrl.c:647 mptcp_sock_def_error_report+0xff/0x110
Mar 19 08:08:58 homeagent4 kernel: [127722.689891] Modules linked in: cls_fw ipip cls_u32 sch_htb xt_policy authenc echainiv xfrm6_mode_tunnel xfrm4_mode_tunnel fou ip6_udp_tunnel udp_tunnel ip_tunnel xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo dummy ip6table_filter ip6_tables iptable_filter xt_statistic xt_connmark xt_conntrack xt_TPROXY nf_defrag_ipv6 xt_addrtype iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ppdev aesni_intel aes_x86_64 crypto_simd glue_helper cryptd snd_pcsp snd_pcm serio_raw parport_pc snd_timer snd soundcore parport mptcp_fullmesh ip_tables autofs4 nvme nvme_core ena i2c_piix4
Mar 19 08:08:58 homeagent4 kernel: [127722.689930] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.105-mptcp-kp2-3f01458b #12
Mar 19 08:08:58 homeagent4 kernel: [127722.689931] Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
Mar 19 08:08:58 homeagent4 kernel: [127722.689932] task: ffffffff822134c0 task.stack: ffffffff82200000
Mar 19 08:08:58 homeagent4 kernel: [127722.689934] RIP: 0010:mptcp_sock_def_error_report+0xff/0x110
Mar 19 08:08:58 homeagent4 kernel: [127722.689936] RSP: 0018:ffff88822d003df0 EFLAGS: 00010286
Mar 19 08:08:58 homeagent4 kernel: [127722.689937] RAX: 000000000000004d RBX: ffff8881d9090000 RCX: 0000000000000000
Mar 19 08:08:58 homeagent4 kernel: [127722.689938] RDX: ffff88822d01e358 RSI: ffff88822d016618 RDI: ffff88822d016618
Mar 19 08:08:58 homeagent4 kernel: [127722.689939] RBP: ffff8881f71f1080 R08: 0000000000000001 R09: 0000000000000b01
Mar 19 08:08:58 homeagent4 kernel: [127722.689940] R10: 0000000000000000 R11: 0000000000000b01 R12: ffff888181101180
Mar 19 08:08:58 homeagent4 kernel: [127722.689941] R13: 0000000000000000 R14: ffff8881d9090000 R15: 0000000000000002
Mar 19 08:08:58 homeagent4 kernel: [127722.689942] FS:  0000000000000000(0000) GS:ffff88822d000000(0000) knlGS:0000000000000000
Mar 19 08:08:58 homeagent4 kernel: [127722.689943] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 19 08:08:58 homeagent4 kernel: [127722.689944] CR2: 00007fdab821f640 CR3: 000000000220a006 CR4: 00000000007606f0
Mar 19 08:08:58 homeagent4 kernel: [127722.689948] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 19 08:08:58 homeagent4 kernel: [127722.689949] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 19 08:08:58 homeagent4 kernel: [127722.689949] PKRU: 55555554
Mar 19 08:08:58 homeagent4 kernel: [127722.689950] Call Trace:
Mar 19 08:08:58 homeagent4 kernel: [127722.689965]  <IRQ>
Mar 19 08:08:58 homeagent4 kernel: [127722.689969]  tcp_write_err+0x2b/0x50
Mar 19 08:08:58 homeagent4 kernel: [127722.689971]  tcp_write_timeout+0x7a/0x430
Mar 19 08:08:58 homeagent4 kernel: [127722.689973]  tcp_retransmit_timer+0x2aa/0x4e0
Mar 19 08:08:58 homeagent4 kernel: [127722.689974]  ? tcp_write_timer_handler+0x240/0x240
Mar 19 08:08:58 homeagent4 kernel: [127722.689976]  mptcp_sub_retransmit_timer+0xe/0xc0
Mar 19 08:08:58 homeagent4 kernel: [127722.689978]  tcp_write_timer_handler+0xed/0x240
Mar 19 08:08:58 homeagent4 kernel: [127722.689980]  tcp_write_timer+0xaf/0xc0
Mar 19 08:08:58 homeagent4 kernel: [127722.689983]  call_timer_fn+0x30/0x130
Mar 19 08:08:58 homeagent4 kernel: [127722.689985]  run_timer_softirq+0x1c7/0x400
Mar 19 08:08:58 homeagent4 kernel: [127722.689989]  ? timerqueue_add+0x54/0x80
Mar 19 08:08:58 homeagent4 kernel: [127722.689990]  ? enqueue_hrtimer+0x38/0x80
Mar 19 08:08:58 homeagent4 kernel: [127722.689994]  __do_softirq+0x10d/0x2b4
Mar 19 08:08:58 homeagent4 kernel: [127722.689999]  irq_exit+0xc2/0xd0
Mar 19 08:08:58 homeagent4 kernel: [127722.690001]  smp_apic_timer_interrupt+0x78/0x140
Mar 19 08:08:58 homeagent4 kernel: [127722.690003]  apic_timer_interrupt+0x7d/0x90
Mar 19 08:08:58 homeagent4 kernel: [127722.690004]  </IRQ>
Mar 19 08:08:58 homeagent4 kernel: [127722.690006] RIP: 0010:native_safe_halt+0x2/0x10
Mar 19 08:08:58 homeagent4 kernel: [127722.690007] RSP: 0018:ffffffff82203eb8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
Mar 19 08:08:58 homeagent4 kernel: [127722.690008] RAX: ffffffff818415a0 RBX: ffffffff822134c0 RCX: 0000000000000000
Mar 19 08:08:58 homeagent4 kernel: [127722.690009] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Mar 19 08:08:58 homeagent4 kernel: [127722.690009] RBP: 0000000000000000 R08: 0000000000000002 R09: 0000000000000003
Mar 19 08:08:58 homeagent4 kernel: [127722.690010] R10: ffffffff82203e28 R11: 0000000000000000 R12: ffffffff822134c0
Mar 19 08:08:58 homeagent4 kernel: [127722.690011] R13: ffffffff822134c0 R14: 0000000000000000 R15: 0000000000000000
Mar 19 08:08:58 homeagent4 kernel: [127722.690013]  ? __cpuidle_text_start+0x8/0x8
Mar 19 08:08:58 homeagent4 kernel: [127722.690015]  default_idle+0x1a/0xf0
Mar 19 08:08:58 homeagent4 kernel: [127722.690017]  do_idle+0x166/0x1d0
Mar 19 08:08:58 homeagent4 kernel: [127722.690020]  cpu_startup_entry+0x19/0x20
Mar 19 08:08:58 homeagent4 kernel: [127722.690024]  start_kernel+0x466/0x46e
Mar 19 08:08:58 homeagent4 kernel: [127722.690027]  secondary_startup_64+0xa5/0xb0
Mar 19 08:08:58 homeagent4 kernel: [127722.690028] Code: 24 20 4c 8b 4d 60 4c 89 e6 48 c7 c7 a0 7a 12 82 44 89 c1 44 89 c2 41 83 e0 01 c0 e9 07 c0 ea 06 0f b6 c9 83 e2 01 e8 8c 21 8d ff <0f> 0b e9 55 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 
Mar 19 08:08:58 homeagent4 kernel: [127722.690057] ---[ end trace 35aabb138b1d57ea ]---

I hope this helps.

cpaasch commented 5 years ago
Mar 19 08:08:58 homeagent4 kernel: [127722.637573] TCP: tcp_done closing ffff88818276ea00 from tcp_ack+0xb63/0x12c0

Interesting! You are coming from tcp_ack. I just fixed an mptcp_sock_def_error_report when coming from mptcp_time_wait. So, there seems to be one more around... I have to see how your box can get in your situation.

Kernel stability has improved a lot with the latest kernel, thanks.

Yes, it has! I have another set of fixes coming soon that should make it even better.

kpeletidis commented 5 years ago

Here is a little context of two more recent warnings (same machine, same kernel).

First one:

Mar 19 12:09:56 homeagent4 kernel: [142180.657861] TCP: tcp_done closing ffff8881baf85c00 from tcp_ack+0xb63/0x12c0
Mar 19 12:10:04 homeagent4 kernel: [142188.054619] TCP: mptcp_fallback_infinite ffff8881b7cf0380 0x910ce999 will fallback - pi 1, src 216.58.206.106:443 dst 192.168.243.17:35185 rcv_nxt 2257741621 from tcp_rcv_state_process+0x24e/0x86d master 1
Mar 19 12:10:04 homeagent4 kernel: [142188.064951] TCP: tcp_done closing ffff8881b7cf0380 from tcp_ack+0xb63/0x12c0
Mar 19 12:10:04 homeagent4 kernel: [142188.070015] TCP: mptcp_fallback_infinite ffff8881b8fdb480 0xa47cd70f will fallback - pi 1, src 216.58.206.106:443 dst 192.168.243.17:52789 rcv_nxt 1784841723 from tcp_rcv_state_process+0x24e/0x86d master 1
Mar 19 12:10:04 homeagent4 kernel: [142188.080158] TCP: tcp_done closing ffff8881b8fdb480 from tcp_ack+0xb63/0x12c0
Mar 19 12:10:20 homeagent4 kernel: [142204.399970] TCP: tcp_done closing ffff8881da031880 from mptcp_close+0x54c/0x5a0
Mar 19 12:10:21 homeagent4 kernel: [142205.948879] TCP: tcp_done closing ffff8881b10f1880 from tcp_write_err+0x3b/0x50
Mar 19 12:12:29 homeagent4 kernel: [142333.771613] TCP: mptcp_fallback_infinite ffff8881ae844a80 0x5c29812b will fallback - pi 3, src 157.240.1.52:443 dst 192.168.246.187:53387 rcv_nxt 2084667492 from tcp_rcv_state_process+0x24e/0x86d master 0
Mar 19 12:12:29 homeagent4 kernel: [142333.782005] TCP: tcp_ack resetting flow
Mar 19 12:12:47 homeagent4 kernel: [142351.354838] Meta ffff8881b10f1880 already closed i_rcv 1 i_snd 1 send_i 0 flags 0x2004303
Mar 19 12:12:47 homeagent4 kernel: [142351.354859] ------------[ cut here ]------------
Mar 19 12:12:47 homeagent4 kernel: [142351.354873] WARNING: CPU: 1 PID: 0 at net/mptcp/mptcp_ctrl.c:647 mptcp_sock_def_error_report+0xff/0x110

And this one:

Mar 19 17:09:40 homeagent4 kernel: [160163.882852] TCP: tcp_ack resetting flow
Mar 19 17:10:07 homeagent4 kernel: [160191.022463] TCP: mptcp_fallback_infinite ffff8881f6715500 0x3952dc1f will fallback - pi 4, src 216.58.206.78:443 dst 192.168.242.223:49045 rcv_nxt 680152616 from tcp_rcv_state_process+0x24e/0x86d master 0
Mar 19 17:10:07 homeagent4 kernel: [160191.032993] TCP: tcp_ack resetting flow
Mar 19 17:10:13 homeagent4 kernel: [160196.928981] TCP: mptcp_fallback_infinite ffff888221da4a80 0x7843dc49 will fallback - pi 6, src 52.213.101.26:443 dst 192.168.243.82:36633 rcv_nxt 1677205542 from tcp_rcv_state_process+0x24e/0x86d master 0
Mar 19 17:10:13 homeagent4 kernel: [160196.939249] TCP: tcp_ack resetting flow
Mar 19 17:10:31 homeagent4 kernel: [160215.560758] TCP: tcp_done closing ffff8881daf19180 from tcp_keepalive_timer+0x209/0x310
Mar 19 17:10:43 homeagent4 kernel: [160226.865877] TCP: mptcp_fallback_infinite ffff8881baa64700 0xa22247f8 will fallback - pi 1, src 78.46.49.23:443 dst 192.168.241.186:36193 rcv_nxt 1419597485 from tcp_rcv_state_process+0x24e/0x86d master 1
Mar 19 17:10:43 homeagent4 kernel: [160226.876156] TCP: tcp_done closing ffff8881baa64700 from mptcp_data_ack+0x9d8/0x9e0
Mar 19 17:10:43 homeagent4 kernel: [160226.883025] TCP: mptcp_fallback_infinite ffff888222d5dc00 0xdaeaa963 will fallback - pi 1, src 31.13.90.6:443 dst 192.168.241.186:59767 rcv_nxt 3743024967 from tcp_rcv_state_process+0x24e/0x86d master 1
Mar 19 17:10:43 homeagent4 kernel: [160227.086559] TCP: tcp_done closing ffff888222d5dc00 from mptcp_sock_def_error_report+0x6b/0x110
Mar 19 17:11:39 homeagent4 kernel: [160283.400217] TCP: tcp_done closing ffff8881d6a60a80 from tcp_keepalive_timer+0x209/0x310
Mar 19 17:11:49 homeagent4 kernel: [160293.639960] Meta ffff8881bcce8380 already closed i_rcv 1 i_snd 1 send_i 0 flags 0x2004303
Mar 19 17:11:49 homeagent4 kernel: [160293.639985] ------------[ cut here ]------------
Mar 19 17:11:49 homeagent4 kernel: [160293.639993] WARNING: CPU: 1 PID: 0 at net/mptcp/mptcp_ctrl.c:647 mptcp_sock_def_error_report+0xff/0x110

Let me know if you need more and I'll keep adding them as I find them.

knxhm commented 5 years ago

I get the same after upgrade to kernel 4.14.110+

Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.393075] Meta already closed i_rcv 1 i_snd 1 send_i 0 flags 0x2004301
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.395577] ------------[ cut here ]------------
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.397512] WARNING: CPU: 3 PID: 0 at net/mptcp/mptcp_ctrl.c:644 mptcp_sock_def_error_report+0x101/0x110
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.401635] Modules linked in: binfmt_misc xt_multiport tun xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_comment xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c iptable_filter snd_pcm snd_timer sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc snd aesni_intel aes_x86_64 vmw_balloon crypto_simd glue_helper soundcore cryptd evdev serio_raw sg pcspkr shpchp vmwgfx ttm drm_kms_helper drm pcc_cpufreq ac vmw_vsock_vmci_transport vsock button vmw_vmci mptcp_balia ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sr_mod cdrom ata_generic sd_mod ata_piix crc32c_intel psmouse libata vmw_pvscsi vmxnet3 i2c_piix4 scsi_mod floppy
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.420369] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.110+ #2
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.422665] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/03/2018
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.426818] task: ffff888236e5e180 task.stack: ffffc90000cd8000
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.428952] RIP: 0010:mptcp_sock_def_error_report+0x101/0x110
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.431055] RSP: 0018:ffff88823fd83e58 EFLAGS: 00010292
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.432995] RAX: 000000000000003c RBX: ffff8882250dc6c0 RCX: 0000000000000000
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.435302] RDX: ffff88823fd9e3b8 RSI: ffff88823fd966d8 RDI: ffff88823fd966d8
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.437576] RBP: ffff888224c5ec80 R08: 000000000000060c R09: 0000000000aaaaaa
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.440054] R10: ffff88823fd83e68 R11: 00000000ffffffff R12: ffff8882102040c0
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.442490] R13: ffffffff8159d100 R14: ffff8882250dc6c0 R15: 0000000000000000
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.444792] FS:  0000000000000000(0000) GS:ffff88823fd80000(0000) knlGS:0000000000000000
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.447295] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.449221] CR2: 00007f7572c03000 CR3: 000000000200a002 CR4: 00000000001606e0
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.451524] Call Trace:
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.452818]  <IRQ>
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.453990]  tcp_write_err+0x2b/0x50
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.455418]  tcp_write_timer_handler+0x16d/0x230
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.457035]  tcp_write_timer+0x98/0xa0
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.458497]  call_timer_fn+0x30/0x130
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.459909]  run_timer_softirq+0x3cc/0x420
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.461398]  ? enqueue_hrtimer+0x37/0x80
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.462857]  ? recalibrate_cpu_khz+0x10/0x10
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.464356]  __do_softirq+0xe3/0x2c7
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.465677]  irq_exit+0xbd/0xd0
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.466873]  smp_apic_timer_interrupt+0x78/0x130
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.468440]  apic_timer_interrupt+0x7d/0x90
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.469875]  </IRQ>
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.470862] RIP: 0010:native_safe_halt+0x2/0x10
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.472342] RSP: 0018:ffffc90000cdbee0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.474459] RAX: ffffffff8162c5f0 RBX: ffffffff8211cc60 RCX: 0000000000000000
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.476437] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.478389] RBP: 0000000000000003 R08: 0000000000000002 R09: 0000000000000001
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.480338] R10: ffffc90000cdbe88 R11: 0000000000000001 R12: 0000000000000000
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.482264] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.484165]  ? __cpuidle_text_start+0x8/0x8
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.485713]  default_idle+0x1a/0x100
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.486898]  do_idle+0x15f/0x1d0
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.488004]  cpu_startup_entry+0x5f/0x70
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.489236]  start_secondary+0x19e/0x1e0
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.490454]  secondary_startup_64+0xa5/0xb0
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.491711] Code: a4 41 0f b6 74 24 20 4c 8b 45 40 48 c7 c7 08 82 dc 81 89 f2 89 f1 40 c0 ee 06 c0 ea 07 83 e1 01 83 e6 01 0f b6 d2 e8 8a 41 ae ff <0f> 0b e9 6b ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41
Apr 25 08:48:26 PTSMAR2 kernel: [ 7656.496754] ---[ end trace 5eacbda62d93e66d ]---
cpaasch commented 5 years ago

The issue at https://github.com/multipath-tcp/mptcp/issues/352 seems to be the same. I have updated the debug-patch on that issue. I need to find out who is calling tcp_write_err. I am unable at the moment to repro... :-/

If you still hit the issue, can you apply the patch from #352 ?

cpaasch commented 5 years ago

Closing, I'm pretty sure it's a dupe of https://github.com/multipath-tcp/mptcp/issues/352.

matttbe commented 5 years ago

A patch is now in different branches.

@knxhm @kpeletidis do you think it could be possible for you to compile a branch and check if you still have the issue please?

kpeletidis commented 4 years ago

Sorry @matttbe I don't have access to that environment anymore.