imsnif / bandwhich

Terminal bandwidth utilization tool
MIT License
10.01k stars 298 forks source link

Failing to report unbound UDP traffic #81

Closed avimar closed 4 years ago

avimar commented 4 years ago

I'm not seeing much traffic (basically just ssh and sshd) despite knowing there's a ton of traffic.

iftop shows me over 20x streams open, but they are all UDP -- freeswitch VoIP streams.

They are listed in lsof -i

Is this a bug or by design? I saw no mention in the docs or issues of tcp vs udp.

Thanks!

imsnif commented 4 years ago

Oh dear - it should definitely support udp! Could you tell us a bit more about your use case, platform and perhaps an easy way to reproduce this?

avimar commented 4 years ago

I downloaded the binary

Ran on:

lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 9.11 (stretch)
Release:        9.11
Codename:       stretch

I'm using FreeSWITCH running lots of traffic.

Not sure an easy way to test... google says netcat?

zhangxp1998 commented 4 years ago

I managed to reproduce this:

  1. Use mosh to connect to any remote host
  2. Run htop or top (or any program that will constantly update the screen) on the remote host
  3. done

Xnip2020-01-06_10-28-16 Xnip2020-01-06_10-29-24

iStats menu shows that mosh-client is using some bandwidth, but bandwhich reports nothing.

imsnif commented 4 years ago

Hmm, with netcat I see UDP traffic. I'm doing:

netcat -u -l -p 9999 # start a udp server

And then:

netcat -u localhost 9999 < /dev/random # send some traffic to the server

And I see traffic in bandwhich and the process is correctly identified as netcat.

I wonder what's different in these use cases? (Will keep investigating if nothing comes up, in any case).

imsnif commented 4 years ago

One idea: @zhangxp1998 - does this happen to you when you use your patched libpnet branch?

zhangxp1998 commented 4 years ago

@imsnif Yes this still happens

zhangxp1998 commented 4 years ago

Ok found why: We indeed captured UDP traffic, but lsof on Mac does not tell you which remote IP/port a UDP socket is connected to(technically UDP has no sense of "connection"). So bandwhich just discarded all UDP utilization info.

zhangxp1998 commented 4 years ago

Further investigation: this regex https://github.com/imsnif/bandwhich/blob/5fdf236668c4dd3b087e0d6e95db5a8c42b7d438/src/os/lsof_utils.rs#L19

Doesn't even pick up UDP lines

Xnip2020-01-06_10-46-08

Since we only use lsof + regex parsing on Mac, this bug should not be present on linux. on linux it's probably another bug?

imsnif commented 4 years ago

AFAIK a UDP datagram does have a source and a destination port though, which is what we need. The lsof bug is not good and we should definitely fix it - but @avimar is on linux, so I suspect this particular problem is elsewhere.

zhangxp1998 commented 4 years ago

AFAIK a UDP datagram does have a source and a destination port though, which is what we need. The lsof bug is not good and we should definitely fix it - but @avimar is on linux, so I suspect this particular problem is elsewhere.

The datagram does have source/destination port. But output from lsof does not. When bandwhich cannot determine which process a packet belongs to, it simply discards that packet.

imsnif commented 4 years ago

The output of my lsof for the netcat test is this:

netcat  295769 aram    3u  IPv4 4212351      0t0  UDP 10.0.0.17:36265->10.0.0.17:9999 

Is it different on a mac?

zhangxp1998 commented 4 years ago

On Mac it looks like

mosh-clie 41688     user    4u  IPv4 0xeb179a665505435d      0t0    UDP *:50651

50651 is the local port. On Mac no remote port is displayed, which is the problem...

EDIT: Sometimes on Linux, /proc/net/udp doesn't show remote port either:

  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops
17658: 00000000:9506 00000000:0000 07 00000000:00000000 00:00000000 00000000   180        0 71820 2 0000000000000000 0

(rem_address being 0)

imsnif commented 4 years ago

Hmm. For outgoing packets, this can make sense. How about if we toss all of those packets into "Unknown TCP" and "Unknown UDP"? We'll show them like that in the Processes pane, skip them in the Connections pane and show them in the Remote Address pane. What do you think?

zhangxp1998 commented 4 years ago

Hmm. For outgoing packets, this can make sense. How about if we toss all of those packets into "Unknown TCP" and "Unknown UDP"? We'll show them like that in the Processes pane, skip them in the Connections pane and show them in the Remote Address pane. What do you think?

Hmmm we should be able to determine which process these packets belong to. For example we could use triple <interface, protocol(TCP/UDP), local_port[0-65535]> instead of <remote_port, remote_ip, protocol> as key in connection_to_procs hash map.

EDIT: Technically, using <remote_port, remote_ip, protocol> for process identification is incorrect, as there can be two different processes connecting to the same service. For example both Safari and Chrome can connect to www.google.com:80

imsnif commented 4 years ago

I'm not sure I understand. Right now, we do this:

pub struct Socket {
    pub ip: Ipv4Addr,
    pub port: u16,
}

// ...

pub struct Connection {
    pub remote_socket: Socket,
    pub protocol: Protocol,
    pub local_port: u16,
}

Could you help me understand what you're suggesting?

zhangxp1998 commented 4 years ago

Sure. Right now we have a HashMap like this: https://github.com/imsnif/bandwhich/blob/5fdf236668c4dd3b087e0d6e95db5a8c42b7d438/src/display/ui_state.rs#L55

Whenever we received a packet, we look up this connections_to_procs HashMap to determine which process this packet belongs to, so we can calculate how much bandwidth a process is using.

Currently, the key to this hash map is connection, which is essentially a tuple:

<remote_ip, remote_port, protocol, local_port>

Problem of this approach: the connections_to_procs HashMap is constructed from output of lsof(or /proc/net/(tcp/udp)), when constructing this hash map we might not have information about remote_ip or remote_port (as the case for UDP). I suggest that we use tuple

<interface(en0, eth0, etc), local_port, protocol>

as key in connections_to_procs.

Why? When reading /proc/net/(udp/tcp) or lsof output, we always know what local port a process is using, what protocol is it, and which interface it is on.

Alcaro commented 4 years ago

The kernel only shows a connected destination if the program called connect(). If it didn't, it can instead specify a destination for each packet, using sendto().

Here's a C program that talks UDP with 1.1.1.1 and 8.8.8.8, calling connect() for only one of the sockets. It looks like this in lsof on Linux (I don't know about macOS and other Unix-likes):

a.out 18303 alcaro 3u IPv4 558532 0t0 UDP *:54852 a.out 18303 alcaro 4u IPv4 558533 0t0 UDP stacked:47363->dns.google:domain

(It's also possible to talk to both servers on the same socket, but it acts weirdly if you didn't call connect(). I suspect the kernel discards incoming packets from wrong source if you connect().)

I can't offer any solutions, but perhaps this can help you understand the problem better.

imsnif commented 4 years ago

@zhangxp1998 - Right - you said "remote port" instead of "local port" before and that's what threw me off. :) Thanks for clearing this up. We indeed match the process by its local port, which afaik is a safe approach since processes should not bind to the same local port with the same protocol.

So, if I understand both you and @Alcaro - the piece we're missing here is the local port. We have it in the packet, but sometimes we don't have it in lsof or /proc. Right?

Alcaro commented 4 years ago

I suggest that we use tuple <interface(en0, eth0, etc), local_port, protocol>

which afaik is a safe approach since processes should not bind to the same local port with the same protocol

I wouldn't recommend that - servers tend to have many connections with iface=eth0 local=80 proto=tcp. Use the remote port/address as well if available; if not, skip it and hope for the best. It'll still give weird results if a UDP server doesn't connect(), but it's right for all TCP, all connect()ors, and all clients. Not perfect, but pretty close.

Even for clients, I also suspect you can get same local port on different connections if you have multiple network interfaces (wifi+ethernet), dual-stack (aka ipv4+ipv6), and of course port 12345/tcp vs 12345/udp, but I didn't test that.

(fork() can also lead to having the same fd open in two processes, attributing Chrome's traffic to another Chrome. But it'll get the process name and target address right, which should be good enough for users.)

Or, even better, tell pcap to report PID for each packet. That'll not only fix disobedient UDP servers, but also raw sockets, like ping and traceroute. If pcap can't do that, BPF probably can.

ebroto commented 4 years ago

Just to add my 2c, what about using the file descriptor to uniquely identify a "connection" (including UDP)? If I'm not mistaken that should be unique.

zhangxp1998 commented 4 years ago

I wouldn't recommend that - servers tend to have many connections with iface=eth0 local=80 proto=tcp

In that case, all these connections will have local port 80, but possibly with different remote ip/ports. Using <local port, interface, protocol> to identify process is correct, because there can't be two processes listening on TCP port 80(if you have two HTTP servers on the same machine, they must be listening on different ports/interfaces). There can be multiple connections going into port 80. We are not trying to identify which connection a packet belongs to, we are simply trying to identify which process sent the packet.

Even for clients, I also suspect you can get same local port on different connections if you have multiple network interfaces (wifi+ethernet), dual-stack (aka ipv4+ipv6), and of course port 12345/tcp vs 12345/udp, but I didn't test that.

Yes you can. That's why I suggested <interface(en0, eth0, etc), local_port, protocol> instead of <local_port, protocol>

Alcaro commented 4 years ago

what about using the file descriptor to uniquely identify a "connection" (including UDP)?

An interesting idea, but fds do, to my knowledge, not have system-unique identifiers accessible to userspace. (They have kernelspace addresses, but kernel doesn't like revealing that. Wouldn't make much sense for a 32bit x86 process on a 64bit kernel, anyways.)

Process ID would work, but judging by this thread, pcap can't return that.

We are not trying to identify which connection a packet belongs to, we are simply trying to identify which process sent the packet.

I somehow thought we needed to match the packet to a specific line in lsof, but you're right, all relevant matches will be owned by the same program, and that's all we need. Remote address is unnecessary, I got confused, sorry about that.

There can be two processes listening on 80/tcp with fork() or SO_REUSEADDR, but worst case is identifying apache2's traffic as another apache2, no real harm. Grouping every apache2 together is probably the desired behavior, anyways.

<interface(en0, eth0, etc), local_port, protocol>

May want to split protocol to transport layer protocol (tcp/udp) and network layer (ipv4/ipv6). It wouldn't surprise me if some systems refuse to assign the same local port between ipv4/v6, but it would surprise me if all systems do.

I believe this is what you meant, but it's easy to misread.

zhangxp1998 commented 4 years ago

May want to split protocol to transport layer protocol (tcp/udp) and network layer (ipv4/ipv6). It wouldn't surprise me if some systems refuse to assign the same local port between ipv4/v6, but it would surprise me if all systems do.

hmmm network layer has no concept of ports... IIRC, when you bind to TCP port 80, you own TCP port 80, regardless of whether IPv4 or IPv6 is used. Ok let me try and see what happens

Xnip2020-01-06_14-10-40

On my machine, binding to the same port twice using different network layer protocol fails. It would surprise me if some system allow such behavior. Different OSI layers should act independently, when I listen on port TCP 80, I want all TCP traffic to port 80 to go to me, regardless of the network layer protocol used by the sender.

If we are concerned we could try to use IPv4/v6 when matching, lsof reveals that information. Since this is a bug, let me draft a fix first, we can refine it later.

Alcaro commented 4 years ago

On my machine, binding to the same port twice using different network layer protocol fails.

I can reproduce your results if I try to bind to any overlapping subset, such as v4 127.0.0.1 (localhost) and v6 :: (any address).

However, non-overlapping combinations work fine, for example v4 and v6 localhost (127.0.0.1, ::1)

>>> import socket; s = socket.socket(socket.AF_INET, socket.SOCK_STREAM); s.bind(('127.0.0.1', 12345)); s.listen()
>>> s2 = socket.socket(socket.AF_INET6, socket.SOCK_STREAM); s2.bind(('::1', 12345)); s2.listen()
>>> 

Conclusion: Correct matching must include local address.

Next question: Is correctness necessary? Probably not for servers, but I believe the randomized client-side addresses can overlap between v4 and v6. I don't know for sure if they do, but I'd rather include at least a v4/v6 flag, just to be safe. (With a fallback to IPv6 lookup if v4 fails, so UDP servers listening to :: get correctly matched. So much special cases...)

ebroto commented 4 years ago

Just a counter-example: image

ebroto commented 4 years ago

@Alcaro for the file descriptor topic, we have it in the output of lsof, at least in Linux: image Here it would be 4 for the TCP connection

EDIT: forget about it, seems to be per process

zhangxp1998 commented 4 years ago

Fixed by PR #82

avimar commented 4 years ago

Merged in release 0.8.0.

Seems to work fine!