Open ferrieux opened 2 years ago
Is there any reason to do special handling with VLANs ?
Yes.
The reason is "that annoying piece of software called the Linux kernel does special handling with VLANs, so that the VLAN tag gets extracted into skbuff metadata and, instead of reinserting the tags back into the packet before handing it to the socket filter and SOCK_PACKET
receive code, so that the filter is applied to the packet as it appeared on the wire, it's applied to the filter with the VLAN tag extracted, so the filter code has to do weird special stuff".
Why not just copy the ethertype into the first two bytes of the SLL2 header and stick all the rest in the payload ?
Because the packet should appear as a VLAN packet, with the appropriate value in the Ethertype, i.e. 0x8100.
Ok, I understand why the kernel sets the tag aside ; what I'm wondering about is why not hide that complexity from the SLLv[12] observers, by shuffling a few bytes around so that the frame is sent to them as it appeared on the wire ?
Ok, I understand why the kernel sets the tag aside ; what I'm wondering about is why not hide that complexity from the SLLv[12] observers, by shuffling a few bytes around so that the frame is sent to them as it appeared on the wire ?
To which observers are you referring?
We can't do that for live capture filters except by submitting a kernel patch.
We can do that once we receive a packet, before delivering it to the callback, and that's what our code is attempting to do. There's probably a bug in the code that does that for SLL2.
Yes, I was referring to the callback side :)
Note that for the single-VLAN case, the bug amounts to a mere permutation, no information is lost (I verified this by fixing it in post-processing of SLL2 output). But for the multiple-VLAN/QinQ case, things get murky, and the second VLAN is lost (and was in SLL1 too). Is there a deep reason, or is this fixable too ?
Note that for the single-VLAN case, the bug amounts to a mere permutation, no information is lost (I verified this by fixing it in post-processing of SLL2 output).
The SLL2 code may still have some bits of SLL code in it, putting things in the wrong places.
But for the multiple-VLAN/QinQ case, things get murky, and the second VLAN is lost (and was in SLL1 too). Is there a deep reason, or is this fixable too ?
The "any" device uses PF_PACKET
/SOCK_DGRAM
sockets; those strip off whatever is set up in the skbuff as a "MAC header", which probably ends up removing everything before the network-layer header, including VLAN tags.
If it used PF_PACKET
/SOCK_RAW
, some supplied packets might have Ethernet headers, some might have 802.11 headers if capturing in monitor mode, and some might have other link-layer headers. That'd make filtering more complicated, and pcap files would require another new link-layer header type, as the ARPHRD_
value would have to be supplied in a pseudo-header so that programs processing the packets would know the link-layer type of each packet.
With pcapng, the capture could include multiple Interface Description Blocks, with each packet having an interface ID referring to the interface on which they arrived, which would remove the "new link-layer type" problem. It would still require more complicated filtering when doing live captures, with BPF programs having to, for example, fetch the link-layer type before doing any link-layer checks such as MAC address checks, and either having to calculate the offset of the network-layer header from that or fetch it if supplied as metadata detachable with BPF (which I think it is). For pcapng files, there could be separate user-mode filter programs, one per interface (with an optimization being to allow interfaces sharing the same link-layer type to share the program).
The current filter compiler probably can't be easily change to generate those "multiple link-layer" programs; something that compiles the filter into an abstract form that's independent of the link-layer type, and that can then either turn that into a "multiple-link-layer program" or into a program for a specific link-layer type, might make it easier.
Thanks for giving the perspective for "any".
However, when capturing a single device like "eth1", the whole frame is passed untouched, and indeed any BPF filter needing to "dig deep" must do so explicitly, without relying on some magical L2-stripping. For example, if we create eth1.24.68 with nested tags 24 and 68, an udp filter on the parent interface eth1 must look like "vlan 24 and vlan 68 and udp", while "udp" won't work.
Similarly, we routinely filter across fixed-depth MPLS stacks, or even through other non-pcapfilter-supported encapsulations like GTP, by using hand-computed offsets and "ether[OFFSET]&MASK=VALUE". The generated BPF bytecode of JIT assembler works just as fast, it is only slightly uglier as a syntax :)
My point is that it is perfectly acceptable for a single interface, so the "tricky BPF" argument doesn't hold there; is it different for "any" ?
However, when capturing a single device like "eth1", the whole frame is passed untouched
It's passed untouched from libpcap to the code using libpcap.
If it's a VLAN-tagged frame, however, it is not passed untouched from the kernel to libpcap; the VLAN tagged are stripped, even when capturing on a PF_PACKET
/SOCK_RAW
socket, and 1) the in-kernel filtering code has to be different in order to test whether the frame is a VLAN frame or to test its VLAN tag, and in order to look at fields past the VLAN tag using the correct order. libpcap then has to reconstruct the "untouched" frame given the VLAN tag metadata.
However, when capturing a single device like "eth1", the whole frame is passed untouched, and indeed any BPF filter needing to "dig deep" must do so explicitly, without relying on some magical L2-stripping.
The "any" device is, currently, best thought of as a supplier of network-layer traffic, not link-layer traffic. You don't get the full link-layer header, so any filters at a lower-level aren't meaningful - there's nothing to "dig deep" through.
Similarly, we routinely filter across fixed-depth MPLS stacks, or even through other non-pcapfilter-supported encapsulations
$ sudo tcpdump -i ens33 -d mpls and udp
(000) ldh [12]
(001) jeq #0x8847 jt 2 jf 19
(002) ldb [16]
(003) and #0x1
(004) jeq #0x1 jt 5 jf 19
(005) ldb [18]
(006) and #0xf0
(007) jeq #0x60 jt 8 jf 13
(008) ldb [24]
(009) jeq #0x11 jt 18 jf 10
(010) jeq #0x2c jt 11 jf 19
(011) ldb [58]
(012) jeq #0x11 jt 18 jf 19
(013) ldb [18]
(014) and #0xf0
(015) jeq #0x40 jt 16 jf 19
(016) ldb [27]
(017) jeq #0x11 jt 18 jf 19
(018) ret #262144
(019) ret #0
like GTP
Support for which should probably be added to libpcap for filtering (there might even be a PR for it somewhere).
My point is that it is perfectly acceptable for a single interface, so the "tricky BPF" argument doesn't hold there; is it different for "any" ?
Because a single interface has a single link-layer type, whereas the "any" device provides packets from all network interfaces, and there's no guarantee that there is a single-link-layer type for all of the interfaces - and, even if it's true at the time the capture starts, it's not guaranteed to remain true. If, for example, a PPP session is initiated, a new PPP interface may pop up, and it's not guaranteed to have a link-layer type of "Ethernet" (Linux not being Windows :-)).
That's why the "any" device is currently a supplier of "network-layer" traffic. Changing it to use PF_PACKET
/SOCK_RAW
and provide link-layer traffic, would involve all the items mentioned above, including the tricky BPF code to handle, in one filter program, multiple different link layers, with the link layer presumably being fetched with a special load from a packet offset of SKF_AD_OFF + SKF_AD_PROTOCOL
.
And, in that case, udp
wouldn't capture VLAN-encapsulated UDP packets on the raw-mode "any" device - you'd need vlan and udp
. For backwards compatibility, we might have to implement that as an "anyraw" device - or have it run in cooked mode when using the current pcap-oriented capture and reading APIs and raw mode when using some future pcapng-oriented capture and reading APIs.
I can't see anything in our current version of libpcap that would cause VLAN tag insertion for SLL2 at all; in activate_pf_packet()
, we set handlep->vlan_offset
to -1 for all link-layer types other than DLT_EN10MB
and DLT_LINUX_SLL
, and if it's -1, we don't insert VLAN tags.
Perhaps the version of libpcap you're using has an "improvement" to insert tags for SLL2, written by somebody who didn't understand the code well enough to realize that, unfortunately, the code was originally written for Ethernet (where the type field is at the end of the link-layer header, so you just move the last 2 bytes of the link-layer header down 4 bytes and drop in the VLAN header), and modified for SLL2 (where the type field is again at the end of the link-layer header), and therefore that, to handle SLL2 (where the type field is at the beginning of the link-layer header), you have to change the code to do the insertion differently.
What version of what Linux distribution are you using? Perhaps they inserted a bug with an "improvement" such as that.
Thanks, indeed seeing "any" as a provider of network-layer traffic is the key. Somehow that was not obvious from the name. Then, you're right, an "anyraw" would cover all the needs of multi-interface, encapsulation-agnostic sniffers. Note that such a tool doesn't need to pretend there is a common link encap , as this could simply be a per-packet field:
struct anyraw_header {
uint16_t direction:1;
uint16_t itfnum:15;
uint16_t dlt; /* most often DLT_EN10MB ; possibly DLT_RAW for L3 devices */
char frame[];
};
What version of what Linux distribution are you using? Perhaps they inserted a bug with an "improvement" such as that.
We are observing this on current Debian stable (11.3):
ii tcpdump 4.99.0-2 amd64 command-line network traffic analyzer
ii libpcap0.8:amd64 1.10.0-2 amd64 system interface for user-level packet capture
We are observing this on current Debian stable (11.3):
That's... odd, as I downloaded the libpcap source package's set of Debian files and couldn't find any patch that would cause it to insert VLAN tags for SLL2. I'll check again, and ignore the names of the patches, and also download the "unmodified source tarball" file.
I'll check again, and ignore the names of the patches, and also download the "unmodified source tarball" file.
Either I've really missed something, or there's nothing in the Debian source package to cause it to try to insert VLAN tags for SLL2.
Could you try it with the current 1.10.1 release (and tcpdump 4.99.1 release, built with that release) from www.tcpdump.org?
some debian specific kernel patch?
Just done so, same results. Reproducing steps:
$ ip link add link wlan0 wlan0.24 type vlan id 24
$ ifconfig wlan0.24 1.0.24.1/24
$ ping -n 1.0.24.3 > /dev/null 2>&1 &
$ tcpdump.4.99.1 -nn -i any -Q out not tcp and not udp
tcpdump.4.99.1: data link type LINUX_SLL2
tcpdump.4.99.1: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
16:14:02.771986 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
16:14:02.771993 wlan0 Out ethertype IPv4, IP0 (invalid) #<==== THE ANNOYING ONES
16:14:03.795923 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
16:14:03.795931 wlan0 Out ethertype IPv4, IP0 (invalid) #<==== THE ANNOYING ONES
16:14:04.819719 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
16:14:04.819724 wlan0 Out ethertype IPv4, IP0 (invalid) #<==== THE ANNOYING ONES
^C
Note tcpdump.4.99.1 is statically linked with just-compiled libpcap.a:
$ ldd `which tcpdump.4.99.1`
linux-vdso.so.1 (0x00007ffec0711000)
libdbus-1.so.3 => /lib/x86_64-linux-gnu/libdbus-1.so.3 (0x00007eff06245000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007eff06080000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007eff0605e000)
libsystemd.so.0 => /lib/x86_64-linux-gnu/libsystemd.so.0 (0x00007eff05fb6000)
/lib64/ld-linux-x86-64.so.2 (0x00007eff0658a000)
libcap.so.2 => /lib/x86_64-linux-gnu/libcap.so.2 (0x00007eff05fab000)
Kernel details:
ii linux-image-5.10.0-10-amd64 5.10.84-1 amd64 Linux 5.10 for 64-bit PCs (signed)
$ uname -a
Linux hpalex 5.10.0-10-amd64 #1 SMP Debian 5.10.84-1 (2021-12-08) x86_64 GNU/Linux
Unfortunately, I couldn't set up a VLAN on the bridge adapter that appears to implement communication between my VMware Fusion virtual machines and my Mac, so I couldn't test it myself.
Please apply the attached patch to pcap-linux.c in the libpcap 1.10.1 source, recompile and rebuild tcpdump with the new library, and see what it prints to the standard error.
DLT_LINUX_SLL: set vlan_offset to 14
tcpdump.4.99.1: data link type LINUX_SLL2
tcpdump.4.99.1: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
Re your setup: please note that you don't need a working end-to-end VLAN; just like I did with ping
eliciting only ARP requests without a single response, you only need to set up the VLAN within the Linux VM. That's enough for tcpdump -i any -Q out
please note that you don't need a working end-to-end VLAN; just like I did with ping eliciting only ARP requests without a single response, you only need to set up the VLAN within the Linux VM.
Thanks! That worked - with tcpdump built with a libpcap prior to 4bfca3682e5aeabe05b4406daf00c9abcc36c571, it got bogus packets on the VLAN interface, but with a tcpdump built with a libpcap after 4bfca3682e5aeabe05b4406daf00c9abcc36c571, it got normal packets.
That fix should be in whatever release is the next libpcap release.
OK, in the git log I see you've just committed this, so I cloned and compiled HEAD of both libpcap and tcpdump.
However the test still fails the same way.
Now looking at the commit, I wonder about your set_vlan_offset
function: it seems to take care of SLLv1 but not SLLv2.
Is this intended ?
static void
set_vlan_offset(pcap_t *handle)
{
...
switch (handle->linktype) {
case DLT_EN10MB:
...
case DLT_LINUX_SLL:
....
default:
...
}
OK, in the git log I see you've just committed this, so I cloned and compiled HEAD of both libpcap and tcpdump. However the test still fails the same way.
OK, try this patch to pcap-linux.c, which adds the same debugging printouts that the previous patch did, and which helped me find the problem in the first place:
Now looking at the commit, I wonder about your set_vlan_offset function: it seems to take care of SLLv1 but not SLLv2. Is this intended ?
Yes.
To handle SLL2 would require significant changes to the tag inserting code in pcap_handle_packet_mmap()
; for SLL2, instead of just moving backwards, by 4 bytes, everything but the last 2 bytes of the link-layer header, and inserting the 4-byte tag in the opened-up space - which works for Ethernet and SLL, as the Ethertype is in the last 2 bytes of the link-layer header - it would have to:
so that you have:
+---------------------------+
| VLAN TPID |
| (2 Octets) |
+---------------------------+
| Reserved (MBZ) |
| (2 Octets) |
+---------------------------+
| Interface index |
| (4 Octets) |
+---------------------------+
| ARPHRD_ type |
| (2 Octets) |
+---------------------------+
| Packet type |
| (1 Octet) |
+---------------------------+
| Link-layer address length |
| (1 Octets) |
+---------------------------+
| Link-layer address |
| (8 Octets) |
+---------------------------+
| VLAN TCI |
| (2 Octets) |
+---------------------------+
| Protocol |
| (2 Octets) |
+---------------------------+
| Payload |
. .
. .
. .
Then make sure that all SLL2 dissectors support that. (It looks as if tcpdump's SLL2 dissector will. I haven't checked Wireshark, but if it assumes that, after the 0x8100, you have a TCI followed by the payload's Ethertype, it'll work.)
However, given that if you have QinQ, the second tag will, I expect, not be provided by the kernel, a case could be made that SLL2 and SLL shouldn't provide tags. The SLL handling of VLAN tags came from this bug. My first fix was 4cc089d1423abe0c7a99e80485eed31f30309792, which caused tags not to be inserted for SLL; I then committed 3f53c50e7d78efe0a74289fb406d749f740577c0, which "[Used] Jakub Zawadzki's changes for that, but put the VLAN tag offset into the pcap structure and [calculated] it at activation time." In retrospect, if additional tags aren't provided on cooked-mode sockets, perhaps they shouldn't be inserted on cooked-mode captures, even for SLL.
Stderr from the patch:
DLT_LINUX_SLL: set vlan_offset to 14
LINUX_SLL2: set vlan_offset to -1
tcpdump.HEAD.patch: data link type LINUX_SLL2
tcpdump.HEAD.patch: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
One side note: in case you want to try QinQ, the same "VM-side only" trick as before does the job:
ip link add link wlan0.24 wlan0.24.68 type vlan id 68
ifconfig wlan0.24.68 1.0.68.1/24
ping -n 1.0.68.3 > /dev/null 2>&1 &
In this case, each packet should come out thrice: once for wlan0
, once for wlan0.24
, and once for wlan0.24.68
.
The latter is already Ok since both tags have already been decapsulated. The other two need fixing in order to get at least valid packets.
Stderr from the patch:
...
LINUX_SLL2: set vlan_offset to -1
OK, so that means that, after tcpdump changes the link-layer type to DLT_LINUX_SLL2
, it sets the vlan_offset value to -1, meaning "don't insert tags.
tcpdump.HEAD.patch: data link type LINUX_SLL2 tcpdump.HEAD.patch: verbose output suppressed, use -v[v]... for full protocol decode listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
So it's not reporting any messages that indicate that it's inserting VLAN tags. Are you sure it's still changing the packet contents? If so, could you attach a capturing using that code?
Here you go:
contents:
10:20:10.749051 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
10:20:10.749058 wlan0 Out ethertype IPv4, IP0 (invalid)
10:20:11.773037 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
10:20:11.773042 wlan0 Out ethertype IPv4, IP0 (invalid)
10:20:12.796985 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
10:20:12.796993 wlan0 Out ethertype IPv4, IP0 (invalid)
Looking into the bytes, I see that the patch did change the layout, so there is no longer any tag insertion, but still the packets on the parent interface (even-numbered packets above) are invalid, because the "protocol" field is set to 0x8100, while the rest of the header and frame are identical (except for interface number of course) to those on the subinterface (odd-numbered packets above):
#layout for valid packet #1 to subinterface wlan0.24
0806 0000 00000005 0001040600bb6036ad620000 000108000604000100bb6036ad620100180100000000000001001803
^^^^ ^^^^^^^^
#layout for invalid packet #2 to parent interface wlan0
8100 0000 00000003 0001040600bb6036ad620000 000108000604000100bb6036ad620100180100000000000001001803
So what's missing is just the actual reinsertion of the tag (0018) +protocol (0806) : to become valid, that packet should be:
#intended layout for packet #2 to parent interface wlan0
8100 0000 00000003 0001040600bb6036ad620000 00180806 000108000604000100bb6036ad620100180100000000000001001803
In today's episode of "I hate the Linux networking stack", I set up a Debian 11.3.0 VM, with a 5.10.0-13-amd64 kernel, built top-of-the-main branch libpcap and tcpdump, set up two layers of VLAN interface (boring packet statistics removed):
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.16.135.200 netmask 255.255.255.0 broadcast 172.16.135.255
inet6 fe80::20c:29ff:fec2:84cf prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:c2:84:cf txqueuelen 1000 (Ethernet)
ens33.10: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.0.0.1 netmask 255.255.255.0 broadcast 0.0.0.0
inet6 fe80::20c:29ff:fec2:84cf prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:c2:84:cf txqueuelen 1000 (Ethernet)
ens33.10.68: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 1.0.68.1 netmask 255.255.255.0 broadcast 0.0.0.0
inet6 fe80::20c:29ff:fec2:84cf prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:c2:84:cf txqueuelen 1000 (Ethernet)
started up 4 tcpdumps:
$ sudo ./tcpdump -i any -w /tmp/deb113-any-sll2.pcap
$ sudo ./tcpdump -i any -y LINUX_SLL -w /tmp/deb113-any-sll.pcap
$ sudo ./tcpdump -i ens33 -w /tmp/deb113-ens33.pcap
$ sudo ./tcpdump -i ens33.10.68 -w /tmp/deb113-ens33.10.68.pcap
and then ran ping 10.68.1.2
for a few failed pings.
According to Wireshark's reading of all four of those files, at around 2022-04-11 22:19.00.273149:
In the SLL2 "any" capture, I see:
In the SLL "any" capture, I see The. Exact. Same. Thing. One ICMPv6 Router Solicitation, and what looks like mangled versions of the same thing - the Protocol field is obviously in a different location in the SLL header, and the SLL header has no interface ID, so the only difference is that the two mangled packets have 0x8100 rather than 0x86dd in the Protocol field.
In the ens33.10.68 capture - which is raw, not cooked, so I get the "full" link-layer header, albeit without any VLAN tags as they were removed in the reception process - I get the ICMPv4 Router Solicitation, at 2022-04-11 22:19:00.273150.
In the ens33 capture - again, raw, not cooked, and the VLAN tags should all be there, as that's not a VLAN interface - I get the ICMPv6 Router Solicitation, with two layers of VLAN tag, one for the ID 10 VLAN and one for the ID 68 VLAN, at 2022-04-11 22:19:00.273157.
My brain wasn't fully engaged, so I didn't do an ens33.10 capture, but I will bet a silk pajama that I would have gotten an ICMPv6 Router Solicitation, with one layer of VLAN tag, for the ID 68 VLAN, at about that time, in such a capture.
So, somehow, the 0x86ddness of that packet is getting lost somewhere between the adapter driver and tcpdump, whether it's in the networking stack below the PF_PACKET
code, the PF_PACKET
code, or libpcap. Once more unto the breach....
So, somehow, the 0x86ddness of that packet is getting lost somewhere between the adapter driver and tcpdump, whether it's in the networking stack below the
PF_PACKET
code, thePF_PACKET
code, or libpcap.
A quick experiment found that, for at least one packet received on the "any" device, the protocol supplied in the recvfrom address and the protocol in the VLAN auxiliary data are both 0x8100. Either there's an 0x86dd, or some other real Ethertype value corresponding to IPv4 or ARP or IPv6 or... hidden somewhere that a PF_PACKET
/SOCK_DGRAM
socket reader can get it, or that information is gone and, well, VLANs and cooked captures do not work together.
Two comments about VLAN+cooked not working together:
(1) in all tcpdump+libpcap prior to the Debian11 (sorry I tend to use packaged versions), "-i any" worked with a single tag, producing the (SLLv1 equivalent of the) layout I wrote above as "intended layout". So it looks like the info is there.
(2) digging a bit in the current code of libpcap, I notice the use of the PACKET_AUXDATA sockopt is #ifdef'ed out, because the configure script somehow fails to set HAVE_PACKET_AUXDATA. I am really ignorant in autoconf, so this is just an impression, but my gut feeling is that there is not even a test for it, in the .ac
input. Would this account for the missing info ?
digging a bit in the current code of libpcap, I notice the use of the PACKET_AUXDATA sockopt is #ifdef'ed out, because the configure script somehow fails to set HAVE_PACKET_AUXDATA. I am really ignorant in autoconf, so this is just an impression, but my gut feeling is that there is not even a test for it, in the .ac input. Would this account for the missing info ?
Prior to f4a26d7d25ac19e547c2f5b6ca4fa60355fb3654, pcap-linux.c did
/*
* On at least some Linux distributions (for example, Red Hat 5.2),
* there's no <netpacket/packet.h> file, but PF_PACKET is defined if
* you include <sys/socket.h>, but <linux/if_packet.h> doesn't define
* any of the PF_PACKET stuff such as "struct sockaddr_ll" or any of
* the PACKET_xxx stuff.
*
* So we check whether PACKET_HOST is defined, and assume that we have
* PF_PACKET sockets only if it is defined.
*/
# ifdef PACKET_HOST
# define HAVE_PF_PACKET_SOCKETS
# ifdef PACKET_AUXDATA
# define HAVE_PACKET_AUXDATA
# endif /* PACKET_AUXDATA */
# endif /* PACKET_HOST */
That change removed support for SOCK_PACKET
sockets, so it removed the test for whether we have PF_PACKET
sockets, as the code now doesn't work if you don't have them.
The
# ifdef PACKET_AUXDATA
# define HAVE_PACKET_AUXDATA
# endif /* PACKET_AUXDATA */
stuff was also removed, but it shouldn't have been removed.
The PACKET_AUXDATA
stuff was added in e59abf81d3de639e389393989c7823ac8a68f663; I'm not sure why their change added HAVE_PACKET_AUXDATA
rather than just testing whether PACKET_AUXDATA
is defined.
We no longer support kernels prior to 2.6.27; the 2.6.27 kernel defined PACKET_AUXDATA
, so just removing the tests for it is the correct fix.
digging a bit in the current code of libpcap, I notice the use of the PACKET_AUXDATA sockopt is #ifdef'ed out, because the configure script somehow fails to set HAVE_PACKET_AUXDATA. I am really ignorant in autoconf, so this is just an impression, but my gut feeling is that there is not even a test for it, in the .ac input. Would this account for the missing info ?
Prior to f4a26d7, pcap-linux.c did
/* * On at least some Linux distributions (for example, Red Hat 5.2), * there's no <netpacket/packet.h> file, but PF_PACKET is defined if * you include <sys/socket.h>, but <linux/if_packet.h> doesn't define * any of the PF_PACKET stuff such as "struct sockaddr_ll" or any of * the PACKET_xxx stuff. * * So we check whether PACKET_HOST is defined, and assume that we have * PF_PACKET sockets only if it is defined. */ # ifdef PACKET_HOST # define HAVE_PF_PACKET_SOCKETS # ifdef PACKET_AUXDATA # define HAVE_PACKET_AUXDATA # endif /* PACKET_AUXDATA */ # endif /* PACKET_HOST */
That change removed support for
SOCK_PACKET
sockets, so it removed the test for whether we havePF_PACKET
sockets, as the code now doesn't work if you don't have them.The
# ifdef PACKET_AUXDATA # define HAVE_PACKET_AUXDATA # endif /* PACKET_AUXDATA */
stuff was also removed, but it shouldn't have been removed.
The
PACKET_AUXDATA
stuff was added in e59abf8; I'm not sure why their change addedHAVE_PACKET_AUXDATA
rather than just testing whetherPACKET_AUXDATA
is defined.We no longer support kernels prior to 2.6.27; the 2.6.27 kernel defined
PACKET_AUXDATA
, so just removing the tests for it is the correct fix.
However, it appears that the memory-mapped receive code path in the kernel doesn't test whether PACKET_AUXDATA
was set, so this probably won't change anything - and, arguably, the PACKET_AUXDATA
can just be removed. I guess it was there for the benefit of code using regular socket receives to read the data, so that code ignorant of the auxiliary data doesn't have to worry about freeing it, or something such as that; code using memory-mapped access that isn't interested in the auxiliary data doesn't have to do anything special, and has no need to control whether it's provided or not.
OK, thanks for the clarification, indeed I was reasoning with the naive model of recv()-based captures :) Now back to our initial problem, what is your intention:
OK, so I've attached a version of pcap-linux.c that attempts to do tag insertion for LINUX_SLL2.
It doesn't do anything about QinQ, but that's because the combination of QinQ and PF_PACKET
/SOCK_DGRAM
appears to be inherently broken. (An image comes to mind, in which the cover of this book is modified, replacing the bunny with Tux and replacing the title with a phrase ending with "again". But I digress....)
(".txt" appended to "pcap-linux.c" to work around GitHub not realizing that C source code files are text.)
(Note: the statement made by the modified book title is a statement about Tux, acting as a proxy for the Linux networking stack, not about any participants in this discussion. Apologies if any participants thought the statement in question was referring to them.
No apologies to the Linux networking stack; the PF_PACKET
code, and the other parts of the Linux networking stack that feed it, have been a persistent source of problems for libpcap, resulting in the addition of a number of ugly workarounds. Yeah, maybe the BPF mechanism is a separate mechanism rather than an integral part of the networking stack, but something to which a driver hands an as-yet-unmodified copy of the link-layer packet, before the stack gets to mangle it in ways that libpcap, tcpdump, and Wireshark have had to work around, seems Really Refreshingly Nice to me.)
Same as before:
$ tcpdump.HEAD.patch2 -nn -i any -Q out not tcp and not udp
tcpdump.HEAD.patch2: data link type LINUX_SLL2
tcpdump.HEAD.patch2: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
21:05:58.809685 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
21:05:58.809687 wlan0 Out ethertype IPv4, IP0 (invalid)
21:05:59.833505 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
21:05:59.833513 wlan0 Out ethertype IPv4, IP0 (invalid)
21:06:00.857250 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
21:06:00.857257 wlan0 Out ethertype IPv4, IP0 (invalid)
Layout of the invalid packets:
Missing VLAN+Type here-----------------------v
8100 0000 00000003 0001040600bb6036ad620000 **** 000108000604000100bb6036ad620100180100000000000001001803
Adding a few printf's here and there, I believe this is because tp_vlan_tci_valid
ends up being always 0 on entry to pcap_handle_packet_mmap
, which falsifies the test
if (tp_vlan_tci_valid && handlep->etype_offset != -1 &&
tp_snaplen >= (unsigned int)handlep->etype_offset + 2)
{
uint16_t etype, vlan_tpid, vlan_tci;
/*
* Add VLAN tag information.
Assuming you tried and succeeded, there must be something special with my wlan0 interface, right ?
For both the TPACKET_V2 and TPACKET_V3 calls, tp_vlan_tci_valid
is 1 if either:
tp_vlan_tci
field of the VLAN tag information provided by the kernel is non-zero;TP_STATUS_VLAN_VALID
bit is set in the tp_status
field of the tpacket header.If neither of them are set, maybe the kernel is just not providing the VLAN information - but that shouldn't differ between SLL and SLL2, as the kernel has no idea whether libpcap is providing SLL or SLL2 headers to its caller.
So is tp_vlan_tci_valid
1 when capturing with SLL and 0 when capturing with SLL2?
Or is it 0 in both cases? If so, is this working with SLL?
It is 0 in both cases, and I do get invalid packets with SLL too on my machine with wlan0 (a laptop with a wifi interface). Note that I also have invalid packets with the Debian-packaged tcpdump. The only place where "-y LINUX_SLL" works with VLANs is a server with only Ethernet interfaces. Will dig a bit.
It is 0 in both cases, and I do get invalid packets with SLL too on my machine with wlan0 (a laptop with a wifi interface). Note that I also have invalid packets with the Debian-packaged tcpdump. The only place where "-y LINUX_SLL" works with VLANs is a server with only Ethernet interfaces.
The sk_buff
structure in the kernel has a vlan_present
flag, indicating that there are VLAN TPID and TCI values in the sk_buff
.
The routine that sets that flag is called __vlan_hwaccel_put_tag()
, and the comment for it says "hardware accelerated VLAN inserting", so that might only be used for adapters that do VLAN processing.
There are several drivers in drivers/net/ethernet that call __vlan_hwaccel_put_tag()
; there are no drivers in drivers/net/wireless that do. Perhaps WiFi adapters, unlike Ethernet adapters, don't tend to do VLAN processing, leaving that up to the host software.
skb_vlan_tag_present()
returns the value of that flag; the PF_PACKET
code sets TP_STATUS_VLAN_VALID
in the tp_status
field of the tpacket header only if that flag is set.
So there is a VLAN tag to insert into the packet data only if the driver sets the tag with __vlan_hwaccel_put_tag()
, and no WiFi drivers appear to do that.
The sll_protocol
field of the struct sockaddr_ll
provided in the tpacket header by the TPACKET code is set to the value of the protocol
field of the sk_buff
. That field is copied to the protocol fields of the SLL and SLL2 pseudo-headers.
For a PF_PACKET
/SOCK_DGRAM
header, the only sk_buff
data provided to the user is the data past the "MAC header".
If the networking stack 1) sets the protocol
field of the sk_buff
for non-hardware-processed VLAN packets to 0x8100 rather than to the underling protocol after you get through all the VLAN tags and 2) includes the VLAN tags as part of the "MAC header" for the sk_buff
, then hilarity ensues - you get a packet that claims to be a VLAN packet but that doesn't have a VLAN tag at the beginning. This Would Not Be Good.
Makes sense. To confirm the theory, I did two experiments:
eth0
, for which ethtool -k
reports vlan offloading is active. In this case, your patch works correctly, and is able to reconstruct a full MAC+VLAN+IP frame for the parent interface.ethtool -K
. In this case, I'm back to the behavior of wlan0
, namely invalid frames.Now (thanks to you) I understand the kernel's insistence to "set the VLAN aside" is directly connected with hardware offloading. What's not entirely clear as yet, is whether your "hilarious" scenario is real or not; that would be a Freaking Kernel Bug, right ?
Ahem, I am afraid you are right, through indirect proof.
To investigate this, I went back to basics: recvmsg()
on a AF_PACKET+SOCK_DGRAM+PACKET_AUXDATA
socket.
(My assumption here, please correct me if I'm wrong, is that the old recvmsg()
API, while it's no longer used by libpcap, is functionally equivalent to the faster, more modern mmap()
API. Hence I expect it to give the exact same information for a given packet, showing the totality of what the kernel exposes.)
As earlier, I set up a VLAN (35=0x23) on eth0
, which is amenable to VLAN offloading or not by ethtool
, and pinged through the VLAN subinterface eth0.35
, looking for the outgoing ARP.
I extracted the auxdata (.msg_control
) and address (.msg_name
aka sll struct) and dumped them below, alongside the packet body.
Each packet is seen twice: once on VLAN subinterface 0c
, and once on parent interface 02
.
[.msg_control=.........................TCI.TPID] < .msg_name=kernel_sll_struct (.flags..) PACKET
WITH VLAN OFFLOADING:
itfnum
|
[107:8:010000001c0000001c0000000000000000000000] < 110008060c0000000100040684a93e3ff5780000 (00000000) 000108000604000184a93e3ff5780100230100000000000001002303
no TCI nor TPID--^ ^---ethtype=0x0806=ARP
[107:8:510000001c0000001c0000000000000023000081] < 11000806020000000100040684a93e3ff5780000 (00000000) 000108000604000184a93e3ff5780100230100000000000001002303
TCI=0x23,TPID=0x8100--^ ^---ethtype=0x0806=ARP
WITHOUT VLAN OFFLOADING:
itfnum
|
[107:8:010000001c0000001c0000000000000000000000] < 110008060c0000000100040684a93e3ff5780000 (00000000) 000108000604000184a93e3ff5780100230100000000000001002303
no TCI nor TPID--^ ^---ethtype=0x0806=ARP
[107:8:010000001c0000001c0000000000000000000000] < 11008100020000000100040684a93e3ff5780000 (00000000) 000108000604000184a93e3ff5780100230100000000000001002303
no TCI nor TPID--^ ^---ethtype=0x8100=WTF???
As you can see, in all cases the subinterface (first packet of each pair) is handled correctly, with the proper original ethertype of the frame surfacing in the sll struct. For the parent interface (second packet of each pair) however, things are different:
As you said, it's a kernel bug, and it is hilarious.
Note that this happens only for outgoing packets. Below is the output for the very same ARP packet, but now incoming. In all cases (with or without hardware VLAN offloading), I get the proper information: (note the packets are swapped, the parent interface talks first this time)
itfnum
|
[107:8:510000002a0000002a0000000000000023000081] < 110008060200000001000106b827eb8a54160000 (00000000) 0001080006040001b827eb8a541601002302000000000000010023030000000000000000000000000000
TCI=0x23,TPID=0x8100--^ ^---ethtype=0x0806=ARP
[107:8:010000002a0000002a0000000000000000000000] < 110008060c00000001000106b827eb8a54160000 (00000000) 0001080006040001b827eb8a541601002302000000000000010023030000000000000000000000000000
no TCI nor TPID--^ ^---ethtype=0x0806=ARP
And interestingly, the ethertype of outgoing packets gets mangled in the SOCK_RAW
case too:
itfnum
|
[107:8:010000002a0000002a00000000000e0000000000] < 110008060c0000000100040684a93e3ff5780000 (00000000) ffffffffffff84a93e3ff5780806000108000604000184a93e3ff5780100230100000000000001002303
no TCI nor TPID--^ ^---ethtype=0x806=ARP
[107:8:010000002e0000002e0000000000120000000000] < 11008100020000000100040684a93e3ff5780000 (00000000) ffffffffffff84a93e3ff578810000230806000108000604000184a93e3ff5780100230100000000000001002303
no TCI nor TPID--^ ^---ethtype=0x8100=WTF???
Of course, this is hardly ever a problem, as in this case (AF_PACKET+SOCK_RAW), the whole L2 frame is available for the user to re-parse.
Adding evidence.
I dug further by using the ftrace
kernel utility, to look at the early life of the skb
.
I walked back from packet_recvmsg
(dequeueing from the raw socket) to packet_rcv
(enqueueing) to dev_queue_xmit_nit
(tapping).
Below you can see the usual pair of outgoing packets on a VLAN'ed non-offloading interface wlan0
.
The important line says proto= :
skb->proto
skb->proto
A stack trace is printed before each such event:
arp_solicit
: this is the ARP request being sent on the subinterface wlan0.24
wlan0
...$ ftrace -stack dev_queue_xmit_nit = 'skb=%bp' 'proto=+0xb0(%bp):x16' k:dev_queue_xmit_nit+311
...
FIRST PACKET : SUBINTERFACE
<idle>-0 [003] .Ns. 188379.276689: dev_queue_xmit_nit+0x0/0x2a0 <-dev_hard_start_xmit+0x64/0x1e0
<idle>-0 [003] .Ns. 188379.276718: <stack trace>
=> 0xffffffffc054006a
=> dev_queue_xmit_nit+0x5/0x2a0
=> dev_hard_start_xmit+0x64/0x1e0
=> __dev_queue_xmit+0x6aa/0x9b0
=> arp_solicit+0xf2/0x300
=> neigh_probe+0x4c/0x60
=> neigh_timer_handler+0x203/0x2e0
=> call_timer_fn+0x29/0xf0
=> __run_timers.part.0+0x1d3/0x240
=> run_timer_softirq+0x26/0x50
=> __do_softirq+0xc5/0x275
=> asm_call_irq_on_stack+0x12/0x20
=> do_softirq_own_stack+0x37/0x40
=> irq_exit_rcu+0x8e/0xc0
=> sysvec_apic_timer_interrupt+0x36/0x80
=> asm_sysvec_apic_timer_interrupt+0x12/0x20
=> cpuidle_enter_state+0xc7/0x350
=> cpuidle_enter+0x29/0x40
=> do_idle+0x1ef/0x2b0
=> cpu_startup_entry+0x19/0x20
=> secondary_startup_64_no_verify+0xb0/0xbb
<idle>-0 [003] .Ns. 188379.276734: kprobe_dev_queue_xmit_nit: (dev_queue_xmit_nit+0x137/0x2a0) skb=0xffff95284b126b00 proto=0x608
SECOND PACKET: PARENT INTERFACE
<idle>-0 [003] .Ns. 188379.276751: dev_queue_xmit_nit+0x0/0x2a0 <-dev_hard_start_xmit+0x64/0x1e0
<idle>-0 [003] .Ns. 188379.276790: <stack trace>
=> 0xffffffffc054006a
=> dev_queue_xmit_nit+0x5/0x2a0
=> dev_hard_start_xmit+0x64/0x1e0
=> __dev_queue_xmit+0x6aa/0x9b0
=> vlan_dev_hard_start_xmit+0x4a/0xf0 [8021q]
=> dev_hard_start_xmit+0xc7/0x1e0
=> __dev_queue_xmit+0x6aa/0x9b0
=> arp_solicit+0xf2/0x300
=> neigh_probe+0x4c/0x60
=> neigh_timer_handler+0x203/0x2e0
=> call_timer_fn+0x29/0xf0
=> __run_timers.part.0+0x1d3/0x240
=> run_timer_softirq+0x26/0x50
=> __do_softirq+0xc5/0x275
=> asm_call_irq_on_stack+0x12/0x20
=> do_softirq_own_stack+0x37/0x40
=> irq_exit_rcu+0x8e/0xc0
=> sysvec_apic_timer_interrupt+0x36/0x80
=> asm_sysvec_apic_timer_interrupt+0x12/0x20
=> cpuidle_enter_state+0xc7/0x350
=> cpuidle_enter+0x29/0x40
=> do_idle+0x1ef/0x2b0
=> cpu_startup_entry+0x19/0x20
=> secondary_startup_64_no_verify+0xb0/0xbb
<idle>-0 [003] .Ns. 188379.276799: kprobe_dev_queue_xmit_nit: (dev_queue_xmit_nit+0x137/0x2a0) skb=0xffff95284b126200 proto=0x81
Bad news: it looks like this bug has been in Linux since ... the Big Bang :( At least, I've verified its presence on 3.10 !
Note: I reported this to the netdev mailing list: https://marc.info/?l=linux-netdev&m=165074467517201&w=4
This is being discussed in Linux netdev again.
https://lore.kernel.org/netdev/20240520070348.26725-1-chengen.du@canonical.com/
Sorry to have missed your earlier report, Alexandre. And I thought about Cc:ing you only just after hitting send. Heads-up this way.
Heh. Thanks for the heads-up Willem. Following up there.
VLAN support in SLL was acceptable, if not perfect: single tags were properly handled, while multiple (nested) tags weren't.
In SLL2, even single tags are completely botched: for the parent (non-tagged) interface, the 802.1Q header lands in the middle of the link-layer src addr, and the payload is shifted by 4 bytes, making its further decoding impossible:
For example: here is an ARP packet going out on VLAN 252 of eth10. The first packet, being on the sub-interface, is okay. The second, being on the parent interface, is broken:
This can be traced down to the hex contents, extracted from "tcpdump -i any" into a pcap file: Below "address" and "padd" give the position of the proper ll address (6-bytes) padded to a 8-byte field:
Here in the second packet we can see the "****" showing the inserted "810000FC" which is the 802.1Q header "VLAN 252", and the way it is somehow "inserted" in the middle of the address (after its first two bytes). This insertion pushes the actual payload 4 bytes to the right.
Is there any reason to do special handling with VLANs ? Why not just copy the ethertype into the first two bytes of the SLL2 header and stick all the rest in the payload ?