tcptop: incorrect bytes sent (tcp_sendmsg)

Hi I noticed that for the tcptop.py tool, binding to the tcp_sendmsg probe to count the bytes could sometimes results into miscounting when tcp_sendmsg fails (it can occur when it returns the EAGAIN error code for instance, it seems that this can happen here).

In this case the tcp_sendmsg probe is triggered and the size parameter passed is used to increment the count of bytes sent without checking for an error.

Some solutions would be to:

use a kretprobe on tcp_sendmsg to check for the returned value (# of bytes sent OR an error) but my understanding is that this can lead to missing tcp_sendmsg since there is a limit for multiple kretprobe running at the same time set by the max_active parameter.
use tcp_transmit_skb which has the # of bytes to be sent inside the sk_buff parameter. This would also help capture bytes sent via TCP retransmits, However it seems that tcp_transmit_skb do not have a reliable access to the PID (so this might require to remove the pid from the tuple ?)

Do you think there is a better way to fix this issue ?

Reproducing the issue

The following was done on a Vagrant VM (hashicorp-vagrant/ubuntu-16.04):

$ uname -a
Linux vagrant 4.4.0-109-generic #132-Ubuntu SMP Tue Jan 9 19:52:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

I was able to reproduce it by setting the so_sndtimeo parameter to a really low value and trying to send a "big" payload.

Running the following command (which will open a local tcp connection and try to send 100Mb of 1s to it should trigger this issue):

nc -l 3000 & python3 script.py 100000

Where script.py is:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import socket
import struct
import sys

addr = "127.0.0.1"
port = 3000

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((addr, port))

# 1us per attempt
sec = 0
usec = 1

timeval = struct.pack('ll', sec, usec)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDTIMEO, timeval)

data = (int(sys.argv[1]) * 1024 * "1").encode()

print("Going to send {} bytes".format(len(data)))

sent = sock.sendmsg([data])

print("Sent {} bytes ({} lost)".format(sent, len(data) - sent))

Output of nc -l 3000 & python3 script.py 100000

Going to send 102400000 bytes
Sent 6855898 bytes (95544102 lost)

Output of sudo python tcptop.py -C:

PID    COMM         LADDR                 RADDR                  RX_KB  TX_KB
4809   4809         127.0.0.1:37004       127.0.0.1:3000             0 100000
4808   4808         127.0.0.1:3000        127.0.0.1:37004         6695      0
1816   sshd         10.0.2.15:22          10.0.2.2:50051             0      0

Expected:

PID    COMM         LADDR                 RADDR                  RX_KB  TX_KB
4809   4809         127.0.0.1:37004       127.0.0.1:3000             0   6695
4808   4808         127.0.0.1:3000        127.0.0.1:37004         6695      0
1816   sshd         10.0.2.15:22          10.0.2.2:50051             0      0

Is this a real-world issue or not?

We could get the size from the kretprobe, but we need the sock args as well, so that's a kprobe + kretprobe. As frequent functions, this is going to cost CPU. If this is a contrived case, I'd rather the docs point it out than for us all to pay a tax for something that doesn't matter in the real world. By docs I mean man page.

So you triggered it by doing a 100 Mbyte sendmsg()? Is that what makes it to tcp_sendmsg()?

Here's a Netflix production example showing sizes and return code:

# bpftrace -e 'k:tcp_sendmsg { @bytes = hist(arg2); } kr:tcp_sendmsg { @retval = hist(retval > 0 ? 0 : - retval); }'
Attaching 2 probes...
^C

@bytes: 
[4, 8)                 9 |                                                    |
[8, 16)               47 |                                                    |
[16, 32)               1 |                                                    |
[32, 64)            4188 |@@@@@@                                              |
[64, 128)            857 |@                                                   |
[128, 256)            21 |                                                    |
[256, 512)            60 |                                                    |
[512, 1K)            545 |                                                    |
[1K, 2K)             962 |@                                                   |
[2K, 4K)            3025 |@@@@                                                |
[4K, 8K)            4009 |@@@@@                                               |
[8K, 16K)          35132 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)           650 |                                                    |
[32K, 64K)           137 |                                                    |
[64K, 128K)            5 |                                                    |

@retval: 
[0]                46651 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

zero errors, and a maximum size in the 64k - 128k bucket.

I can trigger this but from what I gather the expected output doesn't make much sense. It's showing the total write not total transfered which I'd hope exceed the requested size if the socket were left open long enough for recovery. It's quite possiblle for the receiver to be rescheduled out long enough for the timeout to pop or socket buffer to fill, which is expected no? If anything it's an argument for packet pacing.

What are you expecting? TCP throughput to match the on-wire transmit per second? How are you going to deal with TSO and TX queues? The best we can do is close enough at the net driver level, but that (say, post-GSO) means tracing every packet. I'd much rather tcptop trace at the TCP level, as the event rate and overhead is much less.

If you need this from the NIC level, that's a different tool, and one that costs more overhead. What problem is it solving? tcptop(8) is to solve the same problem as (CPU) top(1): oh, I didn't know that PID was sending so much.

Note that @sfluor works at DataDog, so I assume he is cooking up a 24x7 monitoring agent, so understanding overhead and picking the most efficent method is critical. We don't want 24x7 agents burning CPU if it's not really solving a real problem.

Hi, sorry for the late answer:

Is this a real-world issue or not?

We've seen that with consul traffic on our hosts, I got those dumps from tcptop.py and tcptop.py slightly modified to use tcp_transmit_skb

$ cat dump_with_tcp_transmit_skb | grep :8301
9168   consul       172.22.121.168:37672  172.22.138.93:8301      1413    131
0      0            172.22.121.168:37672  172.22.138.93:8301         0   1255
7144   consul       172.22.121.168:37672  172.22.138.93:8301         0     26
24705  consul       172.22.121.168:37672  172.22.138.93:8301         0      0
24286  consul       172.22.121.168:37672  172.22.138.93:8301         0      0
7144   consul       ::ffff:172.22.121.168:8301       ::ffff:172.22.24.93:37252          1413     26
0      0            ::ffff:172.22.121.168:8301       ::ffff:172.22.24.93:37252             0    970
7126   consul       ::ffff:172.22.121.168:8301       ::ffff:172.22.24.93:37252             0    417
7161   consul       ::ffff:172.22.121.168:8301       ::ffff:172.22.24.93:37252             0      0
0      0            172.22.121.168:41314  172.22.85.52:8301          0   1071
7266   consul       172.22.121.168:41314  172.22.85.52:8301          0    316
7138   consul       172.22.121.168:41314  172.22.85.52:8301          0     26
24286  consul       172.22.121.168:41314  172.22.85.52:8301          0      0
7266   consul       172.22.121.168:41314  172.22.85.52:8301       1413      0

$ cat dump_with_tcp_sendmsg | grep :8301
9168   consul       172.22.121.168:37672  172.22.138.93:8301         0   5646
7144   consul       172.22.121.168:37672  172.22.138.93:8301         0   2528
9168   consul       172.22.121.168:37672  172.22.138.93:8301      1413      0
7126   consul       ::ffff:172.22.121.168:8301       ::ffff:172.22.24.93:37252             0   4245
7144   consul       ::ffff:172.22.121.168:8301       ::ffff:172.22.24.93:37252          1413   2530
7266   consul       172.22.121.168:41314  172.22.85.52:8301       1413   4244
7138   consul       172.22.121.168:41314  172.22.85.52:8301          0   2530
7161   consul       172.22.121.168:43820  172.22.132.206:8301        0   5999
7276   consul       172.22.121.168:43820  172.22.132.206:8301     1407   2530

You can see that in one case the result for one exchange with tcp_transmit_skb is (recv: 1.4Mb, Sent: 1.4Mb) whereas with tcp_sendmsg it's (recv: 1.4Mb, sent: ~7Mb)

So you triggered it by doing a 100 Mbyte sendmsg()? Is that what makes it to tcp_sendmsg()?

tcp_sendmsg receives 100Mb as the size parameter, however the return value (how much bytes were sent can be different from that).

I also tried to run bpftrace on my local VM with a similar command to yours and got:

vagrant@ubuntu1904:~$ sudo bpftrace -e 'k:tcp_sendmsg { @bytes = hist(arg2); } kr:tcp_sendmsg { @return = hist(retval); }'
Attaching 2 probes...
^C

@bytes:
[32, 64)              15 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)              0 |                                                    |
[128, 256)             0 |                                                    |
[256, 512)             0 |                                                    |
[512, 1K)              0 |                                                    |
[1K, 2K)               0 |                                                    |
[2K, 4K)               0 |                                                    |
[4K, 8K)               0 |                                                    |
[8K, 16K)              0 |                                                    |
[16K, 32K)             0 |                                                    |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              0 |                                                    |
[16M, 32M)             0 |                                                    |
[32M, 64M)             0 |                                                    |
[64M, 128M)            2 |@@@@@@                                              |

@return:
[32, 64)              14 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)              0 |                                                    |
[128, 256)             0 |                                                    |
[256, 512)             0 |                                                    |
[512, 1K)              0 |                                                    |
[1K, 2K)               0 |                                                    |
[2K, 4K)               0 |                                                    |
[4K, 8K)               0 |                                                    |
[8K, 16K)              0 |                                                    |
[16K, 32K)             0 |                                                    |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               1 |@@@                                                 |
[8M, 16M)              0 |                                                    |
[16M, 32M)             1 |@@@                                                 |

and

vagrant@ubuntu1904:~$ nc -l 3000 > /dev/null & python3 script.py 100000
[1] 9779
Going to send 102400000 bytes
Sent 5860689 bytes (96539311 lost)

it seems that no error is returned however the bytes for arg2 (size) and the return value of tcp_sendmsg are not matching. Could this mean that tcp_sendmsg only sent a partial amount of the bytes to be sent ? (it would then line-up with the value returned by sock.sendmsg in script.py)

@brendangregg Not sure if that was directed at me or sfluor? I don't disagree with you, the test was a bit ambigious.

tcptop(8) is to solve the same problem as (CPU) top(1)

That however could be a bit misleading. It would be more like sockettop (if such a thing existed). Though top does not provide insight into metrics lilke scheduling delays nor any indication as to what the use is. sched_yielld() for example.

I can udnerstand why someone would be confused by the results. I could duplicate specifically the loss using default socket buffers.

We don't want 24x7 agents burning CPU if it's not really solving a real problem.

Fights between departments whom all want their own monitoring on hosts, that hits home.

I can't trace tcp_transmit_skb(), only __tcp_transmit_skb(). But it should be the same thing.

Just looking at the stacks, I see at least one problem:

# bpftrace -e 'k:tcp_sendmsg,k:__tcp_transmit_skb { @[comm, kstack] = count(); }'
Attaching 2 probes...
[...]

@[irq/155-iwlwifi, 
    __tcp_transmit_skb+1
    tcp_send_ack+28
    __tcp_ack_snd_check+92
    tcp_rcv_established+1249
    tcp_v4_do_rcv+144
    tcp_v4_rcv+2423
    ip_local_deliver_finish+98
    ip_local_deliver+111
    ip_rcv_finish+297
    ip_rcv+655
    __netif_receive_skb_core+1074
    __netif_receive_skb+24
    netif_receive_skb_internal+69
    napi_gro_receive+197
    ieee80211_deliver_skb+200
    ieee80211_rx_handlers+5376
    ieee80211_prepare_and_rx_handle+1865
    ieee80211_rx_napi+914
    iwl_mvm_rx_rx_mpdu+1205
    iwl_mvm_rx+77
    iwl_pcie_rx_handle+571
    iwl_pcie_irq_handler+1577
    irq_thread_fn+38
    irq_thread+325
    kthread+289
    ret_from_fork+53
]: 22
@[Chrome_IOThread, 
    __tcp_transmit_skb+1
    __tcp_push_pending_frames+53
    tcp_push+219
    tcp_sendmsg_locked+730
    tcp_sendmsg+44
    inet_sendmsg+46
    sock_sendmsg+62
    SYSC_sendto+319
    sys_sendto+14
    do_syscall_64+115
    entry_SYSCALL_64_after_hwframe+61
]: 35
@[Chrome_IOThread, 
    tcp_sendmsg+1
    sock_sendmsg+62
    SYSC_sendto+319
    sys_sendto+14
    do_syscall_64+115
    entry_SYSCALL_64_after_hwframe+61
]: 45

__tcp_transmit_skb() can happen in IRQ context: so the on-CPU PID and comm are unreliable. tcp_sendmsg() is close to the syscall, and should still be the right context.

If you have a heavy CPU workload, you might find the on-CPU PID for __tcp_transmit_skb() seems correct, as the right PID is on CPU by chance.

There's a number of ways to fix this, but ugh. What we actually want is TCP transmit tracepoints, so we have a stable and reliable way to do this. Plus raw tracepoints are a lot faster than kretprobes, so monitoring is less of a concern (which is why Facebook did the raw tracepoint stuff).

I have some new networking stuff in #1506, but it's not out yet (it will be soon!); it might provide other ways to solve this.

At the very least, the man page should say that tcptop is showing the requested transmit bytes, which may overcount in cases of failures. Maybe we'll have to add more probes later (until tracepoints), but I just don't want to go straight to adding extra overhead for every transmit packet without thinking about the options first.

iovisor / bcc

tcptop: incorrect bytes sent (tcp_sendmsg) #2440

Reproducing the issue