Open sfluor opened 5 years ago
Is this a real-world issue or not?
We could get the size from the kretprobe, but we need the sock args as well, so that's a kprobe + kretprobe. As frequent functions, this is going to cost CPU. If this is a contrived case, I'd rather the docs point it out than for us all to pay a tax for something that doesn't matter in the real world. By docs I mean man page.
So you triggered it by doing a 100 Mbyte sendmsg()? Is that what makes it to tcp_sendmsg()?
Here's a Netflix production example showing sizes and return code:
# bpftrace -e 'k:tcp_sendmsg { @bytes = hist(arg2); } kr:tcp_sendmsg { @retval = hist(retval > 0 ? 0 : - retval); }'
Attaching 2 probes...
^C
@bytes:
[4, 8) 9 | |
[8, 16) 47 | |
[16, 32) 1 | |
[32, 64) 4188 |@@@@@@ |
[64, 128) 857 |@ |
[128, 256) 21 | |
[256, 512) 60 | |
[512, 1K) 545 | |
[1K, 2K) 962 |@ |
[2K, 4K) 3025 |@@@@ |
[4K, 8K) 4009 |@@@@@ |
[8K, 16K) 35132 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K) 650 | |
[32K, 64K) 137 | |
[64K, 128K) 5 | |
@retval:
[0] 46651 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
zero errors, and a maximum size in the 64k - 128k bucket.
I can trigger this but from what I gather the expected output doesn't make much sense. It's showing the total write not total transfered which I'd hope exceed the requested size if the socket were left open long enough for recovery. It's quite possiblle for the receiver to be rescheduled out long enough for the timeout to pop or socket buffer to fill, which is expected no? If anything it's an argument for packet pacing.
What are you expecting? TCP throughput to match the on-wire transmit per second? How are you going to deal with TSO and TX queues? The best we can do is close enough at the net driver level, but that (say, post-GSO) means tracing every packet. I'd much rather tcptop trace at the TCP level, as the event rate and overhead is much less.
If you need this from the NIC level, that's a different tool, and one that costs more overhead. What problem is it solving? tcptop(8) is to solve the same problem as (CPU) top(1): oh, I didn't know that PID was sending so much.
Note that @sfluor works at DataDog, so I assume he is cooking up a 24x7 monitoring agent, so understanding overhead and picking the most efficent method is critical. We don't want 24x7 agents burning CPU if it's not really solving a real problem.
Hi, sorry for the late answer:
Is this a real-world issue or not?
We've seen that with consul traffic on our hosts, I got those dumps from tcptop.py
and tcptop.py
slightly modified to use tcp_transmit_skb
$ cat dump_with_tcp_transmit_skb | grep :8301
9168 consul 172.22.121.168:37672 172.22.138.93:8301 1413 131
0 0 172.22.121.168:37672 172.22.138.93:8301 0 1255
7144 consul 172.22.121.168:37672 172.22.138.93:8301 0 26
24705 consul 172.22.121.168:37672 172.22.138.93:8301 0 0
24286 consul 172.22.121.168:37672 172.22.138.93:8301 0 0
7144 consul ::ffff:172.22.121.168:8301 ::ffff:172.22.24.93:37252 1413 26
0 0 ::ffff:172.22.121.168:8301 ::ffff:172.22.24.93:37252 0 970
7126 consul ::ffff:172.22.121.168:8301 ::ffff:172.22.24.93:37252 0 417
7161 consul ::ffff:172.22.121.168:8301 ::ffff:172.22.24.93:37252 0 0
0 0 172.22.121.168:41314 172.22.85.52:8301 0 1071
7266 consul 172.22.121.168:41314 172.22.85.52:8301 0 316
7138 consul 172.22.121.168:41314 172.22.85.52:8301 0 26
24286 consul 172.22.121.168:41314 172.22.85.52:8301 0 0
7266 consul 172.22.121.168:41314 172.22.85.52:8301 1413 0
$ cat dump_with_tcp_sendmsg | grep :8301
9168 consul 172.22.121.168:37672 172.22.138.93:8301 0 5646
7144 consul 172.22.121.168:37672 172.22.138.93:8301 0 2528
9168 consul 172.22.121.168:37672 172.22.138.93:8301 1413 0
7126 consul ::ffff:172.22.121.168:8301 ::ffff:172.22.24.93:37252 0 4245
7144 consul ::ffff:172.22.121.168:8301 ::ffff:172.22.24.93:37252 1413 2530
7266 consul 172.22.121.168:41314 172.22.85.52:8301 1413 4244
7138 consul 172.22.121.168:41314 172.22.85.52:8301 0 2530
7161 consul 172.22.121.168:43820 172.22.132.206:8301 0 5999
7276 consul 172.22.121.168:43820 172.22.132.206:8301 1407 2530
You can see that in one case the result for one exchange with tcp_transmit_skb
is (recv: 1.4Mb, Sent: 1.4Mb) whereas with tcp_sendmsg
it's (recv: 1.4Mb, sent: ~7Mb)
So you triggered it by doing a 100 Mbyte sendmsg()? Is that what makes it to tcp_sendmsg()?
tcp_sendmsg
receives 100Mb
as the size
parameter, however the return value (how much bytes were sent can be different from that).
I also tried to run bpftrace
on my local VM with a similar command to yours and got:
vagrant@ubuntu1904:~$ sudo bpftrace -e 'k:tcp_sendmsg { @bytes = hist(arg2); } kr:tcp_sendmsg { @return = hist(retval); }'
Attaching 2 probes...
^C
@bytes:
[32, 64) 15 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128) 0 | |
[128, 256) 0 | |
[256, 512) 0 | |
[512, 1K) 0 | |
[1K, 2K) 0 | |
[2K, 4K) 0 | |
[4K, 8K) 0 | |
[8K, 16K) 0 | |
[16K, 32K) 0 | |
[32K, 64K) 0 | |
[64K, 128K) 0 | |
[128K, 256K) 0 | |
[256K, 512K) 0 | |
[512K, 1M) 0 | |
[1M, 2M) 0 | |
[2M, 4M) 0 | |
[4M, 8M) 0 | |
[8M, 16M) 0 | |
[16M, 32M) 0 | |
[32M, 64M) 0 | |
[64M, 128M) 2 |@@@@@@ |
@return:
[32, 64) 14 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128) 0 | |
[128, 256) 0 | |
[256, 512) 0 | |
[512, 1K) 0 | |
[1K, 2K) 0 | |
[2K, 4K) 0 | |
[4K, 8K) 0 | |
[8K, 16K) 0 | |
[16K, 32K) 0 | |
[32K, 64K) 0 | |
[64K, 128K) 0 | |
[128K, 256K) 0 | |
[256K, 512K) 0 | |
[512K, 1M) 0 | |
[1M, 2M) 0 | |
[2M, 4M) 0 | |
[4M, 8M) 1 |@@@ |
[8M, 16M) 0 | |
[16M, 32M) 1 |@@@ |
and
vagrant@ubuntu1904:~$ nc -l 3000 > /dev/null & python3 script.py 100000
[1] 9779
Going to send 102400000 bytes
Sent 5860689 bytes (96539311 lost)
it seems that no error is returned however the bytes for arg2
(size
) and the return value of tcp_sendmsg
are not matching.
Could this mean that tcp_sendmsg
only sent a partial amount of the bytes to be sent ? (it would then line-up with the value returned by sock.sendmsg
in script.py
)
@brendangregg Not sure if that was directed at me or sfluor? I don't disagree with you, the test was a bit ambigious.
tcptop(8) is to solve the same problem as (CPU) top(1)
That however could be a bit misleading. It would be more like sockettop (if such a thing existed). Though top does not provide insight into metrics lilke scheduling delays nor any indication as to what the use is. sched_yielld() for example.
I can udnerstand why someone would be confused by the results. I could duplicate specifically the loss using default socket buffers.
We don't want 24x7 agents burning CPU if it's not really solving a real problem.
Fights between departments whom all want their own monitoring on hosts, that hits home.
I can't trace tcp_transmit_skb(), only __tcp_transmit_skb(). But it should be the same thing.
Just looking at the stacks, I see at least one problem:
# bpftrace -e 'k:tcp_sendmsg,k:__tcp_transmit_skb { @[comm, kstack] = count(); }'
Attaching 2 probes...
[...]
@[irq/155-iwlwifi,
__tcp_transmit_skb+1
tcp_send_ack+28
__tcp_ack_snd_check+92
tcp_rcv_established+1249
tcp_v4_do_rcv+144
tcp_v4_rcv+2423
ip_local_deliver_finish+98
ip_local_deliver+111
ip_rcv_finish+297
ip_rcv+655
__netif_receive_skb_core+1074
__netif_receive_skb+24
netif_receive_skb_internal+69
napi_gro_receive+197
ieee80211_deliver_skb+200
ieee80211_rx_handlers+5376
ieee80211_prepare_and_rx_handle+1865
ieee80211_rx_napi+914
iwl_mvm_rx_rx_mpdu+1205
iwl_mvm_rx+77
iwl_pcie_rx_handle+571
iwl_pcie_irq_handler+1577
irq_thread_fn+38
irq_thread+325
kthread+289
ret_from_fork+53
]: 22
@[Chrome_IOThread,
__tcp_transmit_skb+1
__tcp_push_pending_frames+53
tcp_push+219
tcp_sendmsg_locked+730
tcp_sendmsg+44
inet_sendmsg+46
sock_sendmsg+62
SYSC_sendto+319
sys_sendto+14
do_syscall_64+115
entry_SYSCALL_64_after_hwframe+61
]: 35
@[Chrome_IOThread,
tcp_sendmsg+1
sock_sendmsg+62
SYSC_sendto+319
sys_sendto+14
do_syscall_64+115
entry_SYSCALL_64_after_hwframe+61
]: 45
__tcp_transmit_skb() can happen in IRQ context: so the on-CPU PID and comm are unreliable. tcp_sendmsg() is close to the syscall, and should still be the right context.
If you have a heavy CPU workload, you might find the on-CPU PID for __tcp_transmit_skb() seems correct, as the right PID is on CPU by chance.
There's a number of ways to fix this, but ugh. What we actually want is TCP transmit tracepoints, so we have a stable and reliable way to do this. Plus raw tracepoints are a lot faster than kretprobes, so monitoring is less of a concern (which is why Facebook did the raw tracepoint stuff).
I have some new networking stuff in #1506, but it's not out yet (it will be soon!); it might provide other ways to solve this.
At the very least, the man page should say that tcptop is showing the requested transmit bytes, which may overcount in cases of failures. Maybe we'll have to add more probes later (until tracepoints), but I just don't want to go straight to adding extra overhead for every transmit packet without thinking about the options first.
Hi I noticed that for the
tcptop.py
tool, binding to thetcp_sendmsg
probe to count the bytes could sometimes results into miscounting whentcp_sendmsg
fails (it can occur when it returns theEAGAIN
error code for instance, it seems that this can happen here).In this case the
tcp_sendmsg
probe is triggered and thesize
parameter passed is used to increment the count of bytes sent without checking for an error.Some solutions would be to:
use a
kretprobe
ontcp_sendmsg
to check for the returned value (# of bytes sent OR an error) but my understanding is that this can lead to missingtcp_sendmsg
since there is a limit for multiplekretprobe
running at the same time set by themax_active
parameter.use
tcp_transmit_skb
which has the # of bytes to be sent inside the sk_buff parameter. This would also help capture bytes sent via TCP retransmits, However it seems that tcp_transmit_skb do not have a reliable access to the PID (so this might require to remove the pid from the tuple ?)Do you think there is a better way to fix this issue ?
Reproducing the issue
The following was done on a Vagrant VM (
hashicorp-vagrant/ubuntu-16.04
):I was able to reproduce it by setting the
so_sndtimeo
parameter to a really low value and trying to send a "big" payload.Running the following command (which will open a local tcp connection and try to send 100Mb of
1
s to it should trigger this issue):nc -l 3000 & python3 script.py 100000
Where
script.py
is:Output of
nc -l 3000 & python3 script.py 100000
Output of
sudo python tcptop.py -C
:Expected: