the-tcpdump-group / libpcap

the LIBpcap interface to various kernel packet capture mechanism
https://www.tcpdump.org/
Other
2.66k stars 844 forks source link

SLL=>SLL2: regression on VLAN tagging #1105

Open ferrieux opened 2 years ago

ferrieux commented 2 years ago

VLAN support in SLL was acceptable, if not perfect: single tags were properly handled, while multiple (nested) tags weren't.

In SLL2, even single tags are completely botched: for the parent (non-tagged) interface, the 802.1Q header lands in the middle of the link-layer src addr, and the payload is shifted by 4 bytes, making its further decoding impossible:

For example: here is an ARP packet going out on VLAN 252 of eth10. The first packet, being on the sub-interface, is okay. The second, being on the parent interface, is broken:

17:46:45.413191 eth10.252 Out ARP, Request who-has 0.0.252.9 tell 0.0.252.8, length 28
17:46:45.413193 eth10 Out ARP, Unknown Hardware (8) (len 0), Unknown Protocol (0x0000) (len 1), length 32

This can be traced down to the hex contents, extracted from "tcpdump -i any" into a pcap file: Below "address" and "padd" give the position of the proper ll address (6-bytes) padded to a 8-byte field:

# First packet, OK
<-------SLL2-header--------------------> <---payload-----...>
0806000000000005000104060200000300080000 00010800060400010200000300080000fc080000000000000000fc09 
                        <--address->padd

Here in the second packet we can see the "****" showing the inserted "810000FC" which is the 802.1Q header "VLAN 252", and the way it is somehow "inserted" in the middle of the address (after its first two bytes). This insertion pushes the actual payload 4 bytes to the right.

# Second packet, botched
<-------SLL2-header--------------------> <---payload-----...>
0806000000000003000104060200810000fc0003 0008000000010800060400010200000300080000fc080000000000000000fc09       
                        <-AD********DRES S-->PADD

Is there any reason to do special handling with VLANs ? Why not just copy the ethertype into the first two bytes of the SLL2 header and stick all the rest in the payload ?

guyharris commented 2 years ago

Is there any reason to do special handling with VLANs ?

Yes.

The reason is "that annoying piece of software called the Linux kernel does special handling with VLANs, so that the VLAN tag gets extracted into skbuff metadata and, instead of reinserting the tags back into the packet before handing it to the socket filter and SOCK_PACKET receive code, so that the filter is applied to the packet as it appeared on the wire, it's applied to the filter with the VLAN tag extracted, so the filter code has to do weird special stuff".

Why not just copy the ethertype into the first two bytes of the SLL2 header and stick all the rest in the payload ?

Because the packet should appear as a VLAN packet, with the appropriate value in the Ethertype, i.e. 0x8100.

ferrieux commented 2 years ago

Ok, I understand why the kernel sets the tag aside ; what I'm wondering about is why not hide that complexity from the SLLv[12] observers, by shuffling a few bytes around so that the frame is sent to them as it appeared on the wire ?

guyharris commented 2 years ago

Ok, I understand why the kernel sets the tag aside ; what I'm wondering about is why not hide that complexity from the SLLv[12] observers, by shuffling a few bytes around so that the frame is sent to them as it appeared on the wire ?

To which observers are you referring?

We can't do that for live capture filters except by submitting a kernel patch.

We can do that once we receive a packet, before delivering it to the callback, and that's what our code is attempting to do. There's probably a bug in the code that does that for SLL2.

ferrieux commented 2 years ago

Yes, I was referring to the callback side :)

Note that for the single-VLAN case, the bug amounts to a mere permutation, no information is lost (I verified this by fixing it in post-processing of SLL2 output). But for the multiple-VLAN/QinQ case, things get murky, and the second VLAN is lost (and was in SLL1 too). Is there a deep reason, or is this fixable too ?

guyharris commented 2 years ago

Note that for the single-VLAN case, the bug amounts to a mere permutation, no information is lost (I verified this by fixing it in post-processing of SLL2 output).

The SLL2 code may still have some bits of SLL code in it, putting things in the wrong places.

But for the multiple-VLAN/QinQ case, things get murky, and the second VLAN is lost (and was in SLL1 too). Is there a deep reason, or is this fixable too ?

The "any" device uses PF_PACKET/SOCK_DGRAM sockets; those strip off whatever is set up in the skbuff as a "MAC header", which probably ends up removing everything before the network-layer header, including VLAN tags.

If it used PF_PACKET/SOCK_RAW, some supplied packets might have Ethernet headers, some might have 802.11 headers if capturing in monitor mode, and some might have other link-layer headers. That'd make filtering more complicated, and pcap files would require another new link-layer header type, as the ARPHRD_ value would have to be supplied in a pseudo-header so that programs processing the packets would know the link-layer type of each packet.

With pcapng, the capture could include multiple Interface Description Blocks, with each packet having an interface ID referring to the interface on which they arrived, which would remove the "new link-layer type" problem. It would still require more complicated filtering when doing live captures, with BPF programs having to, for example, fetch the link-layer type before doing any link-layer checks such as MAC address checks, and either having to calculate the offset of the network-layer header from that or fetch it if supplied as metadata detachable with BPF (which I think it is). For pcapng files, there could be separate user-mode filter programs, one per interface (with an optimization being to allow interfaces sharing the same link-layer type to share the program).

The current filter compiler probably can't be easily change to generate those "multiple link-layer" programs; something that compiles the filter into an abstract form that's independent of the link-layer type, and that can then either turn that into a "multiple-link-layer program" or into a program for a specific link-layer type, might make it easier.

ferrieux commented 2 years ago

Thanks for giving the perspective for "any".

However, when capturing a single device like "eth1", the whole frame is passed untouched, and indeed any BPF filter needing to "dig deep" must do so explicitly, without relying on some magical L2-stripping. For example, if we create eth1.24.68 with nested tags 24 and 68, an udp filter on the parent interface eth1 must look like "vlan 24 and vlan 68 and udp", while "udp" won't work.

Similarly, we routinely filter across fixed-depth MPLS stacks, or even through other non-pcapfilter-supported encapsulations like GTP, by using hand-computed offsets and "ether[OFFSET]&MASK=VALUE". The generated BPF bytecode of JIT assembler works just as fast, it is only slightly uglier as a syntax :)

My point is that it is perfectly acceptable for a single interface, so the "tricky BPF" argument doesn't hold there; is it different for "any" ?

guyharris commented 2 years ago

However, when capturing a single device like "eth1", the whole frame is passed untouched

It's passed untouched from libpcap to the code using libpcap.

If it's a VLAN-tagged frame, however, it is not passed untouched from the kernel to libpcap; the VLAN tagged are stripped, even when capturing on a PF_PACKET/SOCK_RAW socket, and 1) the in-kernel filtering code has to be different in order to test whether the frame is a VLAN frame or to test its VLAN tag, and in order to look at fields past the VLAN tag using the correct order. libpcap then has to reconstruct the "untouched" frame given the VLAN tag metadata.

However, when capturing a single device like "eth1", the whole frame is passed untouched, and indeed any BPF filter needing to "dig deep" must do so explicitly, without relying on some magical L2-stripping.

The "any" device is, currently, best thought of as a supplier of network-layer traffic, not link-layer traffic. You don't get the full link-layer header, so any filters at a lower-level aren't meaningful - there's nothing to "dig deep" through.

Similarly, we routinely filter across fixed-depth MPLS stacks, or even through other non-pcapfilter-supported encapsulations

$ sudo tcpdump -i ens33 -d mpls and udp
(000) ldh      [12]
(001) jeq      #0x8847          jt 2    jf 19
(002) ldb      [16]
(003) and      #0x1
(004) jeq      #0x1             jt 5    jf 19
(005) ldb      [18]
(006) and      #0xf0
(007) jeq      #0x60            jt 8    jf 13
(008) ldb      [24]
(009) jeq      #0x11            jt 18   jf 10
(010) jeq      #0x2c            jt 11   jf 19
(011) ldb      [58]
(012) jeq      #0x11            jt 18   jf 19
(013) ldb      [18]
(014) and      #0xf0
(015) jeq      #0x40            jt 16   jf 19
(016) ldb      [27]
(017) jeq      #0x11            jt 18   jf 19
(018) ret      #262144
(019) ret      #0

like GTP

Support for which should probably be added to libpcap for filtering (there might even be a PR for it somewhere).

My point is that it is perfectly acceptable for a single interface, so the "tricky BPF" argument doesn't hold there; is it different for "any" ?

Because a single interface has a single link-layer type, whereas the "any" device provides packets from all network interfaces, and there's no guarantee that there is a single-link-layer type for all of the interfaces - and, even if it's true at the time the capture starts, it's not guaranteed to remain true. If, for example, a PPP session is initiated, a new PPP interface may pop up, and it's not guaranteed to have a link-layer type of "Ethernet" (Linux not being Windows :-)).

That's why the "any" device is currently a supplier of "network-layer" traffic. Changing it to use PF_PACKET/SOCK_RAW and provide link-layer traffic, would involve all the items mentioned above, including the tricky BPF code to handle, in one filter program, multiple different link layers, with the link layer presumably being fetched with a special load from a packet offset of SKF_AD_OFF + SKF_AD_PROTOCOL.

And, in that case, udp wouldn't capture VLAN-encapsulated UDP packets on the raw-mode "any" device - you'd need vlan and udp. For backwards compatibility, we might have to implement that as an "anyraw" device - or have it run in cooked mode when using the current pcap-oriented capture and reading APIs and raw mode when using some future pcapng-oriented capture and reading APIs.

guyharris commented 2 years ago

I can't see anything in our current version of libpcap that would cause VLAN tag insertion for SLL2 at all; in activate_pf_packet(), we set handlep->vlan_offset to -1 for all link-layer types other than DLT_EN10MB and DLT_LINUX_SLL, and if it's -1, we don't insert VLAN tags.

Perhaps the version of libpcap you're using has an "improvement" to insert tags for SLL2, written by somebody who didn't understand the code well enough to realize that, unfortunately, the code was originally written for Ethernet (where the type field is at the end of the link-layer header, so you just move the last 2 bytes of the link-layer header down 4 bytes and drop in the VLAN header), and modified for SLL2 (where the type field is again at the end of the link-layer header), and therefore that, to handle SLL2 (where the type field is at the beginning of the link-layer header), you have to change the code to do the insertion differently.

What version of what Linux distribution are you using? Perhaps they inserted a bug with an "improvement" such as that.

ferrieux commented 2 years ago

Thanks, indeed seeing "any" as a provider of network-layer traffic is the key. Somehow that was not obvious from the name. Then, you're right, an "anyraw" would cover all the needs of multi-interface, encapsulation-agnostic sniffers. Note that such a tool doesn't need to pretend there is a common link encap , as this could simply be a per-packet field:

  struct anyraw_header {
         uint16_t direction:1;
         uint16_t itfnum:15;
         uint16_t dlt;   /* most often DLT_EN10MB ; possibly DLT_RAW for L3 devices */
         char frame[];
  };
ferrieux commented 2 years ago

What version of what Linux distribution are you using? Perhaps they inserted a bug with an "improvement" such as that.

We are observing this on current Debian stable (11.3):

ii  tcpdump                           4.99.0-2                                         amd64        command-line network traffic analyzer
ii  libpcap0.8:amd64                  1.10.0-2                                         amd64        system interface for user-level packet capture
guyharris commented 2 years ago

We are observing this on current Debian stable (11.3):

That's... odd, as I downloaded the libpcap source package's set of Debian files and couldn't find any patch that would cause it to insert VLAN tags for SLL2. I'll check again, and ignore the names of the patches, and also download the "unmodified source tarball" file.

guyharris commented 2 years ago

I'll check again, and ignore the names of the patches, and also download the "unmodified source tarball" file.

Either I've really missed something, or there's nothing in the Debian source package to cause it to try to insert VLAN tags for SLL2.

Could you try it with the current 1.10.1 release (and tcpdump 4.99.1 release, built with that release) from www.tcpdump.org?

mcr commented 2 years ago

some debian specific kernel patch?

ferrieux commented 2 years ago

Just done so, same results. Reproducing steps:

$ ip link add link wlan0 wlan0.24 type vlan id 24
$ ifconfig wlan0.24 1.0.24.1/24
$ ping -n 1.0.24.3 > /dev/null 2>&1 &
$ tcpdump.4.99.1 -nn -i any -Q out not tcp and not udp
tcpdump.4.99.1: data link type LINUX_SLL2
tcpdump.4.99.1: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
16:14:02.771986 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
16:14:02.771993 wlan0 Out ethertype IPv4, IP0 (invalid) #<==== THE ANNOYING ONES
16:14:03.795923 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
16:14:03.795931 wlan0 Out ethertype IPv4, IP0 (invalid) #<==== THE ANNOYING ONES
16:14:04.819719 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
16:14:04.819724 wlan0 Out ethertype IPv4, IP0 (invalid) #<==== THE ANNOYING ONES
^C

Note tcpdump.4.99.1 is statically linked with just-compiled libpcap.a:

$ ldd `which tcpdump.4.99.1`
        linux-vdso.so.1 (0x00007ffec0711000)
        libdbus-1.so.3 => /lib/x86_64-linux-gnu/libdbus-1.so.3 (0x00007eff06245000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007eff06080000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007eff0605e000)
        libsystemd.so.0 => /lib/x86_64-linux-gnu/libsystemd.so.0 (0x00007eff05fb6000)
        /lib64/ld-linux-x86-64.so.2 (0x00007eff0658a000)
        libcap.so.2 => /lib/x86_64-linux-gnu/libcap.so.2 (0x00007eff05fab000)

Kernel details:

ii  linux-image-5.10.0-10-amd64              5.10.84-1                                           amd64        Linux 5.10 for 64-bit PCs (signed)

$ uname -a
Linux hpalex 5.10.0-10-amd64 #1 SMP Debian 5.10.84-1 (2021-12-08) x86_64 GNU/Linux
guyharris commented 2 years ago

Unfortunately, I couldn't set up a VLAN on the bridge adapter that appears to implement communication between my VMware Fusion virtual machines and my Mac, so I couldn't test it myself.

Please apply the attached patch to pcap-linux.c in the libpcap 1.10.1 source, recompile and rebuild tcpdump with the new library, and see what it prints to the standard error.

patch.txt

ferrieux commented 2 years ago
DLT_LINUX_SLL: set vlan_offset to 14
tcpdump.4.99.1: data link type LINUX_SLL2
tcpdump.4.99.1: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes

Re your setup: please note that you don't need a working end-to-end VLAN; just like I did with ping eliciting only ARP requests without a single response, you only need to set up the VLAN within the Linux VM. That's enough for tcpdump -i any -Q out

guyharris commented 2 years ago

please note that you don't need a working end-to-end VLAN; just like I did with ping eliciting only ARP requests without a single response, you only need to set up the VLAN within the Linux VM.

Thanks! That worked - with tcpdump built with a libpcap prior to 4bfca3682e5aeabe05b4406daf00c9abcc36c571, it got bogus packets on the VLAN interface, but with a tcpdump built with a libpcap after 4bfca3682e5aeabe05b4406daf00c9abcc36c571, it got normal packets.

That fix should be in whatever release is the next libpcap release.

ferrieux commented 2 years ago

OK, in the git log I see you've just committed this, so I cloned and compiled HEAD of both libpcap and tcpdump. However the test still fails the same way. Now looking at the commit, I wonder about your set_vlan_offset function: it seems to take care of SLLv1 but not SLLv2. Is this intended ?

static void
set_vlan_offset(pcap_t *handle)
{
  ...
  switch (handle->linktype) {
    case DLT_EN10MB:
    ...
    case DLT_LINUX_SLL:
    ....
    default:
  ...
}
guyharris commented 2 years ago

OK, in the git log I see you've just committed this, so I cloned and compiled HEAD of both libpcap and tcpdump. However the test still fails the same way.

OK, try this patch to pcap-linux.c, which adds the same debugging printouts that the previous patch did, and which helped me find the problem in the first place:

patch.txt

Now looking at the commit, I wonder about your set_vlan_offset function: it seems to take care of SLLv1 but not SLLv2. Is this intended ?

Yes.

To handle SLL2 would require significant changes to the tag inserting code in pcap_handle_packet_mmap(); for SLL2, instead of just moving backwards, by 4 bytes, everything but the last 2 bytes of the link-layer header, and inserting the 4-byte tag in the opened-up space - which works for Ethernet and SLL, as the Ethertype is in the last 2 bytes of the link-layer header - it would have to:

so that you have:

+---------------------------+
|          VLAN TPID        |
|         (2 Octets)        |
+---------------------------+
|       Reserved (MBZ)      |
|         (2 Octets)        |
+---------------------------+
|       Interface index     |
|         (4 Octets)        |
+---------------------------+
|        ARPHRD_ type       |
|         (2 Octets)        |
+---------------------------+
|         Packet type       |
|         (1 Octet)         |
+---------------------------+
| Link-layer address length |
|         (1 Octets)        |
+---------------------------+
|    Link-layer address     |
|         (8 Octets)        |
+---------------------------+
|          VLAN TCI         |
|         (2 Octets)        |
+---------------------------+
|          Protocol         |
|         (2 Octets)        |
+---------------------------+
|           Payload         |
.                           .
.                           .
.                           .

Then make sure that all SLL2 dissectors support that. (It looks as if tcpdump's SLL2 dissector will. I haven't checked Wireshark, but if it assumes that, after the 0x8100, you have a TCI followed by the payload's Ethertype, it'll work.)

However, given that if you have QinQ, the second tag will, I expect, not be provided by the kernel, a case could be made that SLL2 and SLL shouldn't provide tags. The SLL handling of VLAN tags came from this bug. My first fix was 4cc089d1423abe0c7a99e80485eed31f30309792, which caused tags not to be inserted for SLL; I then committed 3f53c50e7d78efe0a74289fb406d749f740577c0, which "[Used] Jakub Zawadzki's changes for that, but put the VLAN tag offset into the pcap structure and [calculated] it at activation time." In retrospect, if additional tags aren't provided on cooked-mode sockets, perhaps they shouldn't be inserted on cooked-mode captures, even for SLL.

ferrieux commented 2 years ago

Stderr from the patch:

DLT_LINUX_SLL: set vlan_offset to 14
LINUX_SLL2: set vlan_offset to -1
tcpdump.HEAD.patch: data link type LINUX_SLL2
tcpdump.HEAD.patch: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
ferrieux commented 2 years ago

One side note: in case you want to try QinQ, the same "VM-side only" trick as before does the job:

ip link add link wlan0.24 wlan0.24.68 type vlan id 68
ifconfig wlan0.24.68 1.0.68.1/24
ping -n 1.0.68.3 > /dev/null 2>&1 &

In this case, each packet should come out thrice: once for wlan0, once for wlan0.24, and once for wlan0.24.68. The latter is already Ok since both tags have already been decapsulated. The other two need fixing in order to get at least valid packets.

guyharris commented 2 years ago

Stderr from the patch:

...

LINUX_SLL2: set vlan_offset to -1

OK, so that means that, after tcpdump changes the link-layer type to DLT_LINUX_SLL2, it sets the vlan_offset value to -1, meaning "don't insert tags.

tcpdump.HEAD.patch: data link type LINUX_SLL2
tcpdump.HEAD.patch: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes

So it's not reporting any messages that indicate that it's inserting VLAN tags. Are you sure it's still changing the packet contents? If so, could you attach a capturing using that code?

ferrieux commented 2 years ago

Here you go:

sll2.cap.gz

contents:

10:20:10.749051 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
10:20:10.749058 wlan0 Out ethertype IPv4, IP0 (invalid)
10:20:11.773037 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
10:20:11.773042 wlan0 Out ethertype IPv4, IP0 (invalid)
10:20:12.796985 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
10:20:12.796993 wlan0 Out ethertype IPv4, IP0 (invalid)

Looking into the bytes, I see that the patch did change the layout, so there is no longer any tag insertion, but still the packets on the parent interface (even-numbered packets above) are invalid, because the "protocol" field is set to 0x8100, while the rest of the header and frame are identical (except for interface number of course) to those on the subinterface (odd-numbered packets above):

#layout for valid packet #1 to subinterface wlan0.24
0806 0000 00000005 0001040600bb6036ad620000 000108000604000100bb6036ad620100180100000000000001001803

^^^^      ^^^^^^^^

#layout for invalid packet #2 to parent interface wlan0
8100 0000 00000003 0001040600bb6036ad620000 000108000604000100bb6036ad620100180100000000000001001803

So what's missing is just the actual reinsertion of the tag (0018) +protocol (0806) : to become valid, that packet should be:

#intended layout for packet #2 to parent interface wlan0
8100 0000 00000003 0001040600bb6036ad620000 00180806 000108000604000100bb6036ad620100180100000000000001001803
guyharris commented 2 years ago

In today's episode of "I hate the Linux networking stack", I set up a Debian 11.3.0 VM, with a 5.10.0-13-amd64 kernel, built top-of-the-main branch libpcap and tcpdump, set up two layers of VLAN interface (boring packet statistics removed):

ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.16.135.200  netmask 255.255.255.0  broadcast 172.16.135.255
        inet6 fe80::20c:29ff:fec2:84cf  prefixlen 64  scopeid 0x20<link>
        ether 00:0c:29:c2:84:cf  txqueuelen 1000  (Ethernet)

ens33.10: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.0.1  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::20c:29ff:fec2:84cf  prefixlen 64  scopeid 0x20<link>
        ether 00:0c:29:c2:84:cf  txqueuelen 1000  (Ethernet)

ens33.10.68: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 1.0.68.1  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::20c:29ff:fec2:84cf  prefixlen 64  scopeid 0x20<link>
        ether 00:0c:29:c2:84:cf  txqueuelen 1000  (Ethernet)

started up 4 tcpdumps:

$ sudo ./tcpdump -i any -w /tmp/deb113-any-sll2.pcap
$ sudo ./tcpdump -i any -y LINUX_SLL -w /tmp/deb113-any-sll.pcap
$ sudo ./tcpdump -i ens33 -w /tmp/deb113-ens33.pcap
$ sudo ./tcpdump -i ens33.10.68 -w /tmp/deb113-ens33.10.68.pcap

and then ran ping 10.68.1.2 for a few failed pings.

According to Wireshark's reading of all four of those files, at around 2022-04-11 22:19.00.273149:

In the SLL2 "any" capture, I see:

In the SLL "any" capture, I see The. Exact. Same. Thing. One ICMPv6 Router Solicitation, and what looks like mangled versions of the same thing - the Protocol field is obviously in a different location in the SLL header, and the SLL header has no interface ID, so the only difference is that the two mangled packets have 0x8100 rather than 0x86dd in the Protocol field.

In the ens33.10.68 capture - which is raw, not cooked, so I get the "full" link-layer header, albeit without any VLAN tags as they were removed in the reception process - I get the ICMPv4 Router Solicitation, at 2022-04-11 22:19:00.273150.

In the ens33 capture - again, raw, not cooked, and the VLAN tags should all be there, as that's not a VLAN interface - I get the ICMPv6 Router Solicitation, with two layers of VLAN tag, one for the ID 10 VLAN and one for the ID 68 VLAN, at 2022-04-11 22:19:00.273157.

My brain wasn't fully engaged, so I didn't do an ens33.10 capture, but I will bet a silk pajama that I would have gotten an ICMPv6 Router Solicitation, with one layer of VLAN tag, for the ID 68 VLAN, at about that time, in such a capture.

So, somehow, the 0x86ddness of that packet is getting lost somewhere between the adapter driver and tcpdump, whether it's in the networking stack below the PF_PACKET code, the PF_PACKET code, or libpcap. Once more unto the breach....

guyharris commented 2 years ago

So, somehow, the 0x86ddness of that packet is getting lost somewhere between the adapter driver and tcpdump, whether it's in the networking stack below the PF_PACKET code, the PF_PACKET code, or libpcap.

A quick experiment found that, for at least one packet received on the "any" device, the protocol supplied in the recvfrom address and the protocol in the VLAN auxiliary data are both 0x8100. Either there's an 0x86dd, or some other real Ethertype value corresponding to IPv4 or ARP or IPv6 or... hidden somewhere that a PF_PACKET/SOCK_DGRAM socket reader can get it, or that information is gone and, well, VLANs and cooked captures do not work together.

ferrieux commented 2 years ago

Two comments about VLAN+cooked not working together: (1) in all tcpdump+libpcap prior to the Debian11 (sorry I tend to use packaged versions), "-i any" worked with a single tag, producing the (SLLv1 equivalent of the) layout I wrote above as "intended layout". So it looks like the info is there. (2) digging a bit in the current code of libpcap, I notice the use of the PACKET_AUXDATA sockopt is #ifdef'ed out, because the configure script somehow fails to set HAVE_PACKET_AUXDATA. I am really ignorant in autoconf, so this is just an impression, but my gut feeling is that there is not even a test for it, in the .ac input. Would this account for the missing info ?

guyharris commented 2 years ago

digging a bit in the current code of libpcap, I notice the use of the PACKET_AUXDATA sockopt is #ifdef'ed out, because the configure script somehow fails to set HAVE_PACKET_AUXDATA. I am really ignorant in autoconf, so this is just an impression, but my gut feeling is that there is not even a test for it, in the .ac input. Would this account for the missing info ?

Prior to f4a26d7d25ac19e547c2f5b6ca4fa60355fb3654, pcap-linux.c did

 /*
  * On at least some Linux distributions (for example, Red Hat 5.2),
  * there's no <netpacket/packet.h> file, but PF_PACKET is defined if
  * you include <sys/socket.h>, but <linux/if_packet.h> doesn't define
  * any of the PF_PACKET stuff such as "struct sockaddr_ll" or any of
  * the PACKET_xxx stuff.
  *
  * So we check whether PACKET_HOST is defined, and assume that we have
  * PF_PACKET sockets only if it is defined.
  */
# ifdef PACKET_HOST
#  define HAVE_PF_PACKET_SOCKETS
#  ifdef PACKET_AUXDATA
#   define HAVE_PACKET_AUXDATA
#  endif /* PACKET_AUXDATA */
# endif /* PACKET_HOST */

That change removed support for SOCK_PACKET sockets, so it removed the test for whether we have PF_PACKET sockets, as the code now doesn't work if you don't have them.

The

#  ifdef PACKET_AUXDATA
#   define HAVE_PACKET_AUXDATA
#  endif /* PACKET_AUXDATA */

stuff was also removed, but it shouldn't have been removed.

The PACKET_AUXDATA stuff was added in e59abf81d3de639e389393989c7823ac8a68f663; I'm not sure why their change added HAVE_PACKET_AUXDATA rather than just testing whether PACKET_AUXDATA is defined.

We no longer support kernels prior to 2.6.27; the 2.6.27 kernel defined PACKET_AUXDATA, so just removing the tests for it is the correct fix.

guyharris commented 2 years ago

digging a bit in the current code of libpcap, I notice the use of the PACKET_AUXDATA sockopt is #ifdef'ed out, because the configure script somehow fails to set HAVE_PACKET_AUXDATA. I am really ignorant in autoconf, so this is just an impression, but my gut feeling is that there is not even a test for it, in the .ac input. Would this account for the missing info ?

Prior to f4a26d7, pcap-linux.c did

 /*
  * On at least some Linux distributions (for example, Red Hat 5.2),
  * there's no <netpacket/packet.h> file, but PF_PACKET is defined if
  * you include <sys/socket.h>, but <linux/if_packet.h> doesn't define
  * any of the PF_PACKET stuff such as "struct sockaddr_ll" or any of
  * the PACKET_xxx stuff.
  *
  * So we check whether PACKET_HOST is defined, and assume that we have
  * PF_PACKET sockets only if it is defined.
  */
# ifdef PACKET_HOST
#  define HAVE_PF_PACKET_SOCKETS
#  ifdef PACKET_AUXDATA
#   define HAVE_PACKET_AUXDATA
#  endif /* PACKET_AUXDATA */
# endif /* PACKET_HOST */

That change removed support for SOCK_PACKET sockets, so it removed the test for whether we have PF_PACKET sockets, as the code now doesn't work if you don't have them.

The

#  ifdef PACKET_AUXDATA
#   define HAVE_PACKET_AUXDATA
#  endif /* PACKET_AUXDATA */

stuff was also removed, but it shouldn't have been removed.

The PACKET_AUXDATA stuff was added in e59abf8; I'm not sure why their change added HAVE_PACKET_AUXDATA rather than just testing whether PACKET_AUXDATA is defined.

We no longer support kernels prior to 2.6.27; the 2.6.27 kernel defined PACKET_AUXDATA, so just removing the tests for it is the correct fix.

However, it appears that the memory-mapped receive code path in the kernel doesn't test whether PACKET_AUXDATA was set, so this probably won't change anything - and, arguably, the PACKET_AUXDATA can just be removed. I guess it was there for the benefit of code using regular socket receives to read the data, so that code ignorant of the auxiliary data doesn't have to worry about freeing it, or something such as that; code using memory-mapped access that isn't interested in the auxiliary data doesn't have to do anything special, and has no need to control whether it's provided or not.

ferrieux commented 2 years ago

OK, thanks for the clarification, indeed I was reasoning with the naive model of recv()-based captures :) Now back to our initial problem, what is your intention:

guyharris commented 2 years ago

OK, so I've attached a version of pcap-linux.c that attempts to do tag insertion for LINUX_SLL2.

It doesn't do anything about QinQ, but that's because the combination of QinQ and PF_PACKET/SOCK_DGRAM appears to be inherently broken. (An image comes to mind, in which the cover of this book is modified, replacing the bunny with Tux and replacing the title with a phrase ending with "again". But I digress....)

(".txt" appended to "pcap-linux.c" to work around GitHub not realizing that C source code files are text.)

pcap-linux.c.txt

guyharris commented 2 years ago

(Note: the statement made by the modified book title is a statement about Tux, acting as a proxy for the Linux networking stack, not about any participants in this discussion. Apologies if any participants thought the statement in question was referring to them.

No apologies to the Linux networking stack; the PF_PACKET code, and the other parts of the Linux networking stack that feed it, have been a persistent source of problems for libpcap, resulting in the addition of a number of ugly workarounds. Yeah, maybe the BPF mechanism is a separate mechanism rather than an integral part of the networking stack, but something to which a driver hands an as-yet-unmodified copy of the link-layer packet, before the stack gets to mangle it in ways that libpcap, tcpdump, and Wireshark have had to work around, seems Really Refreshingly Nice to me.)

ferrieux commented 2 years ago

Same as before:

$ tcpdump.HEAD.patch2 -nn -i any  -Q out not tcp and not udp 
tcpdump.HEAD.patch2: data link type LINUX_SLL2
tcpdump.HEAD.patch2: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
21:05:58.809685 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
21:05:58.809687 wlan0 Out ethertype IPv4, IP0 (invalid)
21:05:59.833505 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
21:05:59.833513 wlan0 Out ethertype IPv4, IP0 (invalid)
21:06:00.857250 wlan0.24 Out ARP, Request who-has 1.0.24.3 tell 1.0.24.1, length 28
21:06:00.857257 wlan0 Out ethertype IPv4, IP0 (invalid)

Layout of the invalid packets:

Missing VLAN+Type here-----------------------v 
8100 0000 00000003 0001040600bb6036ad620000 **** 000108000604000100bb6036ad620100180100000000000001001803

Adding a few printf's here and there, I believe this is because tp_vlan_tci_valid ends up being always 0 on entry to pcap_handle_packet_mmap, which falsifies the test

        if (tp_vlan_tci_valid && handlep->etype_offset != -1 &&
            tp_snaplen >= (unsigned int)handlep->etype_offset + 2)
        {
                uint16_t etype, vlan_tpid, vlan_tci;

                /*                                                                                                                                                                                                                                                                                                             
                 * Add VLAN tag information.                                                                                                                                                                                                                                                                                   

Assuming you tried and succeeded, there must be something special with my wlan0 interface, right ?

guyharris commented 2 years ago

For both the TPACKET_V2 and TPACKET_V3 calls, tp_vlan_tci_valid is 1 if either:

  1. the tp_vlan_tci field of the VLAN tag information provided by the kernel is non-zero;
  2. the TP_STATUS_VLAN_VALID bit is set in the tp_status field of the tpacket header.

If neither of them are set, maybe the kernel is just not providing the VLAN information - but that shouldn't differ between SLL and SLL2, as the kernel has no idea whether libpcap is providing SLL or SLL2 headers to its caller.

So is tp_vlan_tci_valid 1 when capturing with SLL and 0 when capturing with SLL2?

Or is it 0 in both cases? If so, is this working with SLL?

ferrieux commented 2 years ago

It is 0 in both cases, and I do get invalid packets with SLL too on my machine with wlan0 (a laptop with a wifi interface). Note that I also have invalid packets with the Debian-packaged tcpdump. The only place where "-y LINUX_SLL" works with VLANs is a server with only Ethernet interfaces. Will dig a bit.

guyharris commented 2 years ago

It is 0 in both cases, and I do get invalid packets with SLL too on my machine with wlan0 (a laptop with a wifi interface). Note that I also have invalid packets with the Debian-packaged tcpdump. The only place where "-y LINUX_SLL" works with VLANs is a server with only Ethernet interfaces.

The sk_buff structure in the kernel has a vlan_present flag, indicating that there are VLAN TPID and TCI values in the sk_buff.

The routine that sets that flag is called __vlan_hwaccel_put_tag(), and the comment for it says "hardware accelerated VLAN inserting", so that might only be used for adapters that do VLAN processing.

There are several drivers in drivers/net/ethernet that call __vlan_hwaccel_put_tag(); there are no drivers in drivers/net/wireless that do. Perhaps WiFi adapters, unlike Ethernet adapters, don't tend to do VLAN processing, leaving that up to the host software.

skb_vlan_tag_present() returns the value of that flag; the PF_PACKET code sets TP_STATUS_VLAN_VALID in the tp_status field of the tpacket header only if that flag is set.

So there is a VLAN tag to insert into the packet data only if the driver sets the tag with __vlan_hwaccel_put_tag(), and no WiFi drivers appear to do that.

The sll_protocol field of the struct sockaddr_ll provided in the tpacket header by the TPACKET code is set to the value of the protocol field of the sk_buff. That field is copied to the protocol fields of the SLL and SLL2 pseudo-headers.

For a PF_PACKET/SOCK_DGRAM header, the only sk_buff data provided to the user is the data past the "MAC header".

If the networking stack 1) sets the protocol field of the sk_buff for non-hardware-processed VLAN packets to 0x8100 rather than to the underling protocol after you get through all the VLAN tags and 2) includes the VLAN tags as part of the "MAC header" for the sk_buff, then hilarity ensues - you get a packet that claims to be a VLAN packet but that doesn't have a VLAN tag at the beginning. This Would Not Be Good.

ferrieux commented 2 years ago

Makes sense. To confirm the theory, I did two experiments:

Now (thanks to you) I understand the kernel's insistence to "set the VLAN aside" is directly connected with hardware offloading. What's not entirely clear as yet, is whether your "hilarious" scenario is real or not; that would be a Freaking Kernel Bug, right ?

ferrieux commented 2 years ago

Ahem, I am afraid you are right, through indirect proof. To investigate this, I went back to basics: recvmsg() on a AF_PACKET+SOCK_DGRAM+PACKET_AUXDATA socket.

(My assumption here, please correct me if I'm wrong, is that the old recvmsg() API, while it's no longer used by libpcap, is functionally equivalent to the faster, more modern mmap() API. Hence I expect it to give the exact same information for a given packet, showing the totality of what the kernel exposes.)

As earlier, I set up a VLAN (35=0x23) on eth0, which is amenable to VLAN offloading or not by ethtool, and pinged through the VLAN subinterface eth0.35, looking for the outgoing ARP.

I extracted the auxdata (.msg_control) and address (.msg_name aka sll struct) and dumped them below, alongside the packet body. Each packet is seen twice: once on VLAN subinterface 0c, and once on parent interface 02.

[.msg_control=.........................TCI.TPID] < .msg_name=kernel_sll_struct              (.flags..) PACKET

WITH VLAN OFFLOADING:

                                                         itfnum
                                                            |
[107:8:010000001c0000001c0000000000000000000000] < 110008060c0000000100040684a93e3ff5780000 (00000000) 000108000604000184a93e3ff5780100230100000000000001002303
                        no TCI nor TPID--^              ^---ethtype=0x0806=ARP
[107:8:510000001c0000001c0000000000000023000081] < 11000806020000000100040684a93e3ff5780000 (00000000) 000108000604000184a93e3ff5780100230100000000000001002303
                   TCI=0x23,TPID=0x8100--^              ^---ethtype=0x0806=ARP

WITHOUT VLAN OFFLOADING:
                                                         itfnum
                                                            |
[107:8:010000001c0000001c0000000000000000000000] < 110008060c0000000100040684a93e3ff5780000 (00000000) 000108000604000184a93e3ff5780100230100000000000001002303
                        no TCI nor TPID--^              ^---ethtype=0x0806=ARP
[107:8:010000001c0000001c0000000000000000000000] < 11008100020000000100040684a93e3ff5780000 (00000000) 000108000604000184a93e3ff5780100230100000000000001002303
                        no TCI nor TPID--^              ^---ethtype=0x8100=WTF???

As you can see, in all cases the subinterface (first packet of each pair) is handled correctly, with the proper original ethertype of the frame surfacing in the sll struct. For the parent interface (second packet of each pair) however, things are different:

As you said, it's a kernel bug, and it is hilarious.

ferrieux commented 2 years ago

Note that this happens only for outgoing packets. Below is the output for the very same ARP packet, but now incoming. In all cases (with or without hardware VLAN offloading), I get the proper information: (note the packets are swapped, the parent interface talks first this time)

                                                         itfnum
                                                            |
[107:8:510000002a0000002a0000000000000023000081] < 110008060200000001000106b827eb8a54160000 (00000000) 0001080006040001b827eb8a541601002302000000000000010023030000000000000000000000000000
                   TCI=0x23,TPID=0x8100--^              ^---ethtype=0x0806=ARP
[107:8:010000002a0000002a0000000000000000000000] < 110008060c00000001000106b827eb8a54160000 (00000000) 0001080006040001b827eb8a541601002302000000000000010023030000000000000000000000000000
                        no TCI nor TPID--^              ^---ethtype=0x0806=ARP
ferrieux commented 2 years ago

And interestingly, the ethertype of outgoing packets gets mangled in the SOCK_RAW case too:

                                                         itfnum
                                                            |
[107:8:010000002a0000002a00000000000e0000000000] < 110008060c0000000100040684a93e3ff5780000 (00000000) ffffffffffff84a93e3ff5780806000108000604000184a93e3ff5780100230100000000000001002303
                        no TCI nor TPID--^              ^---ethtype=0x806=ARP

[107:8:010000002e0000002e0000000000120000000000] < 11008100020000000100040684a93e3ff5780000 (00000000) ffffffffffff84a93e3ff578810000230806000108000604000184a93e3ff5780100230100000000000001002303
                        no TCI nor TPID--^              ^---ethtype=0x8100=WTF???

Of course, this is hardly ever a problem, as in this case (AF_PACKET+SOCK_RAW), the whole L2 frame is available for the user to re-parse.

ferrieux commented 2 years ago

Adding evidence. I dug further by using the ftrace kernel utility, to look at the early life of the skb. I walked back from packet_recvmsg (dequeueing from the raw socket) to packet_rcv (enqueueing) to dev_queue_xmit_nit (tapping).

Below you can see the usual pair of outgoing packets on a VLAN'ed non-offloading interface wlan0. The important line says proto= :

A stack trace is printed before each such event:

$ ftrace -stack dev_queue_xmit_nit = 'skb=%bp'  'proto=+0xb0(%bp):x16'  k:dev_queue_xmit_nit+311
...

FIRST PACKET : SUBINTERFACE

          <idle>-0       [003] .Ns. 188379.276689: dev_queue_xmit_nit+0x0/0x2a0 <-dev_hard_start_xmit+0x64/0x1e0
          <idle>-0       [003] .Ns. 188379.276718: <stack trace>
 => 0xffffffffc054006a
 => dev_queue_xmit_nit+0x5/0x2a0
 => dev_hard_start_xmit+0x64/0x1e0
 => __dev_queue_xmit+0x6aa/0x9b0
 => arp_solicit+0xf2/0x300
 => neigh_probe+0x4c/0x60
 => neigh_timer_handler+0x203/0x2e0
 => call_timer_fn+0x29/0xf0
 => __run_timers.part.0+0x1d3/0x240
 => run_timer_softirq+0x26/0x50
 => __do_softirq+0xc5/0x275
 => asm_call_irq_on_stack+0x12/0x20
 => do_softirq_own_stack+0x37/0x40
 => irq_exit_rcu+0x8e/0xc0
 => sysvec_apic_timer_interrupt+0x36/0x80
 => asm_sysvec_apic_timer_interrupt+0x12/0x20
 => cpuidle_enter_state+0xc7/0x350
 => cpuidle_enter+0x29/0x40
 => do_idle+0x1ef/0x2b0
 => cpu_startup_entry+0x19/0x20
 => secondary_startup_64_no_verify+0xb0/0xbb
          <idle>-0       [003] .Ns. 188379.276734: kprobe_dev_queue_xmit_nit: (dev_queue_xmit_nit+0x137/0x2a0) skb=0xffff95284b126b00 proto=0x608

SECOND PACKET: PARENT INTERFACE

          <idle>-0       [003] .Ns. 188379.276751: dev_queue_xmit_nit+0x0/0x2a0 <-dev_hard_start_xmit+0x64/0x1e0
          <idle>-0       [003] .Ns. 188379.276790: <stack trace>
 => 0xffffffffc054006a
 => dev_queue_xmit_nit+0x5/0x2a0
 => dev_hard_start_xmit+0x64/0x1e0
 => __dev_queue_xmit+0x6aa/0x9b0
 => vlan_dev_hard_start_xmit+0x4a/0xf0 [8021q]
 => dev_hard_start_xmit+0xc7/0x1e0
 => __dev_queue_xmit+0x6aa/0x9b0
 => arp_solicit+0xf2/0x300
 => neigh_probe+0x4c/0x60
 => neigh_timer_handler+0x203/0x2e0
 => call_timer_fn+0x29/0xf0
 => __run_timers.part.0+0x1d3/0x240
 => run_timer_softirq+0x26/0x50
 => __do_softirq+0xc5/0x275
 => asm_call_irq_on_stack+0x12/0x20
 => do_softirq_own_stack+0x37/0x40
 => irq_exit_rcu+0x8e/0xc0
 => sysvec_apic_timer_interrupt+0x36/0x80
 => asm_sysvec_apic_timer_interrupt+0x12/0x20
 => cpuidle_enter_state+0xc7/0x350
 => cpuidle_enter+0x29/0x40
 => do_idle+0x1ef/0x2b0
 => cpu_startup_entry+0x19/0x20
 => secondary_startup_64_no_verify+0xb0/0xbb
          <idle>-0       [003] .Ns. 188379.276799: kprobe_dev_queue_xmit_nit: (dev_queue_xmit_nit+0x137/0x2a0) skb=0xffff95284b126200 proto=0x81
ferrieux commented 2 years ago

Bad news: it looks like this bug has been in Linux since ... the Big Bang :( At least, I've verified its presence on 3.10 !

ferrieux commented 2 years ago

Note: I reported this to the netdev mailing list: https://marc.info/?l=linux-netdev&m=165074467517201&w=4

wdebruij commented 4 months ago

This is being discussed in Linux netdev again.

https://lore.kernel.org/netdev/20240520070348.26725-1-chengen.du@canonical.com/

Sorry to have missed your earlier report, Alexandre. And I thought about Cc:ing you only just after hitting send. Heads-up this way.

ferrieux commented 4 months ago

Heh. Thanks for the heads-up Willem. Following up there.