p4lang / p4app-switchML

Switch ML Application
https://switchml.readthedocs.io/
Apache License 2.0
173 stars 48 forks source link

Port traffic redirection with logic bypass #10

Open kfertakis opened 2 years ago

kfertakis commented 2 years ago

Hi,

I would like to benchmark SwitchML on an emulated topology bigger than that of a single rack. To do so, I plan to reuse the same Tofino switch(APS Networks BF2556X-1T) with SwitchML deployed on it and redirect all traffic coming in from a specific port into another port without any processing done. That port would then be physically connected to yet another port which will consume the traffic regularly via the SwitchML logic.

I was wondering whether this is something that is already possible with the existing P4 model and SwitchML controller program? If not, is there a way I could modify the Forwarder table in forwarder.p4 (populated via the forwarder.py on the control plane) to accommodate this functionality without messing the underlying SwitchML logic? Practically, I need the traffic from that specific port to bypass the SwitchML aggregation logic in the switch and just be redirected out of another port (and vice versa for the return journey). From what I understand, currently, the forwarder table matches on ethernet addresses. Could this be somehow extended to accommodate the case without jeopardizing the underlying functionality of the P4 program?

Thank you,

AmedeoSapio commented 2 years ago

Hi, the current code doesn't support this, but I think it would be a useful feature. A user might want to disable SwitchML processing on some specific port as you suggest, and this should be achieved by setting a parameter in the controller (like with an additional CLI command).

I think that this can be easily done with a change to the parser logic (I think that we cannot do this in the control block, because it would add a dependency that would make the program not fit in the 12 stages). Regular (not-to-be-processed) packets are marked in the parser with packet_type_t.IGNORE;

https://github.com/p4lang/p4app-switchML/blob/7ded628eb31a86e80ad0bf65e72a3e94e0e13d05/dev_root/p4/parsers.p4#L204-L209

and this flag makes the packet skip the SwitchML processing logic and go to the forwarder table that forwards regular traffic using the dst mac address.

You cannot achieve what you ask by simply changing the forwarder table, because you need first to skip the processing logic to reach that table. If you look at the main Ingress control block https://github.com/p4lang/p4app-switchML/blob/7ded628eb31a86e80ad0bf65e72a3e94e0e13d05/dev_root/p4/switchml.p4#L137-L138

you can see that the forwarder table, which is in charge of forwarding regular traffic, is only applied for packet types that are not >= packet_type_t.CONSUME0 (packet_type_t.IGNORE is less than packet_type_t.CONSUME0). The packet_type definition is here https://github.com/p4lang/p4app-switchML/blob/7ded628eb31a86e80ad0bf65e72a3e94e0e13d05/dev_root/p4/types.p4#L142-L147

The parser code is marking as packet_type_t.CONSUME0 (= to be processed) all the RDMA UC packets and UDP packets with dst port in the interval [0xbee0, 0xbeef]. So we need to modify the parser to mark packets from the chosen port as IGNORE, even if they are of these types. We still need to extract the ethernet header though, to use the dst mac address in the forwarder table.

The way that I think is best suited to do this is to use port metadata, which is a feature available in the parser. There is one example on how to use the port metadata feature in the Open-Tofino repo here.

Each ingress hardware port can be associated with a small amount (64 bits) of static data that is prepended to each packet in the ingress pipeline. These metadata can be set by the controller for each port differently. The port metadata header is the first header that the parser sees in a packet, even before ethernet.

The controller could set a "processing type" in the port metadata for each port (something like SKIP, PROCESS, or RECIRCULATE). Then the parse_port_metadata parser can select the next parser state using this value (RECIRCULATE ➝ parse_recirculate, else ➝ parse_ethernet). Then the parse_ethernet stage can use this value, and if it is SKIP, transition to accept_regular, which would make it skip the processing logic and reach the forwarder table.

This solution is also the one to use to avoid hardcoding the ports to use for recirculation, like the parse_port_metadata parser is doing now. https://github.com/p4lang/p4app-switchML/blob/7ded628eb31a86e80ad0bf65e72a3e94e0e13d05/dev_root/p4/parsers.p4#L50-L58 This would allow to use any port for recirculation (selected at runtime by the controller), and would allow to have a folded pipe option also on 2-pipes switches like yours.

Currently, the port metadata is only used to set 2x16 bits values for the drop simulator (used to randomly drop a chosen percentage of the packets, to test the retransmission logic). These values are set by the controller, per port, in drop_simulator.py.

https://github.com/p4lang/p4app-switchML/blob/7ded628eb31a86e80ad0bf65e72a3e94e0e13d05/dev_root/controller/drop_simulator.py#L28

We can use up to 64 bits for the port metadata, so there is plenty of space to add the flag we need.

kfertakis commented 2 years ago

Hi,

Thank you very much for the detailed response and for providing an overview of the proposed solution.

I understand that two modifications are needed in order to support the said case. First, we need to modify the ingress parser in order to skip the processing logic for all packets coming from a specific port. I understand that the method you explained of utilizing the port metadata available at the parse, would be the best way of doing this giving the ability to the controller to dynamically select the port that its traffic would skip the processing logic. However, in order to test out my setup until such a full flex feature is implemented, could I just test it manually by including something like this

state parse_port_metadata {
        // parse port metadata
        ig_md.port_metadata = port_metadata_unpack<port_metadata_t>(pkt);

        transition select(ig_intr_md.ingress_port) {
            45: parse_ethernet; // custom port for bypassing SwitchML logic
            64: parse_recirculate; // pipe 0 CPU Eth port
            68: parse_recirculate; // pipe 0 recirc port
            320: parse_ethernet;   // pipe 2 CPU PCIe port
            0x080 &&& 0x180: parse_recirculate; // all pipe 1 ports
            0x100 &&& 0x180: parse_recirculate; // all pipe 2 ports
            0x180 &&& 0x180: parse_recirculate; // all pipe 3 ports
            default:  parse_ethernet;
        }
    }

in the parse_port_metadata state where for example I would be statically setting the device port 45 for bypassing the regular processing by transitioning to parse_ethernet as the next stage which already includes a default action for accept_regular. Would something like this work?

The second necessary modification would then be to change the forwarder table in order to match the ingress port besides the destination mac address and populate it with a rule that will push traffic from the ingress port 45 I used in my example previously, to another specific port which I'll be using for forwarding the packets out of the switch. This should be possible, right?

Thank you again,

AmedeoSapio commented 2 years ago

Even without the change to the parse_port_metadata parser that you propose, your packet will still transition to parse_ethernet. In fact, the original code makes all the packets with ingress ports in pipe 0 (which are [0-63]) transition to the default choice, which is parse_ethernet. We need to parse the ethernet header, whether we want to process the packet or not, because the forwarder table needs the ethernet dst address. The only case where we don't parse the ethernet header is when we recirculate the packet because we use our custom header to carry metadata (this header is never sent outside the switch).

I think that the change that you need is to modify the transition out of the the parse_ethernet parser, making it something like this:

    state parse_ethernet {
        pkt.extract(hdr.ethernet);
        transition select(hdr.ethernet.ether_type, ig_intr_md.ingress_port) {
            (_,  45) : accept_regular;
            (ETHERTYPE_ARP,  _) : parse_arp;
            (ETHERTYPE_IPV4, _) : parse_ipv4;
            default : accept_regular;
        }
    }

This would transition to accept_regular if the ingress port is 45, whatever the ethertype is, which would make the packet skip the processing. The other 2 options use the ethertype to proceed with the parsing logic, which, depending on the packet format, can reach the case where we set the packet type to CONSUME0 to mark the packet as to-be-processed.

And finally, yes, you need to modify the forwarder table, to match on the ingress port and set the egress port accordingly. This change requires to change also the controller code to add the ingress port to the set of match values here:

https://github.com/p4lang/p4app-switchML/blob/7ded628eb31a86e80ad0bf65e72a3e94e0e13d05/dev_root/controller/forwarder.py#L81-L87

kfertakis commented 2 years ago

Thanks again for the input.

I have indeed modified the parse_ethernet parser as you suggested. However, I'm having a bit of difficulty setting the forwarder table right. I have modified the table as follows:

table forward {
        key = {
            hdr.ethernet.dst_addr : exact;
            ig_intr_md.ingress_port : exact;
        }
        actions = {
            set_egress_port;
            flood;
        }
        size = forwarding_table_size;
    }

in order to be able to match to the ingress port beside the dst address. However, I'm not sure how to properly configure the table. When I'm populating the table through the controller with a simple rule for forwarding from one port to another with the following:

 self.table.entry_add(self.target, [
            self.table.make_key(
                [self.gc.KeyTuple('ig_intr_md.ingress_port', ingress_dev_port)])
        ], [
            self.table.make_data([self.gc.DataTuple('egress_port', egress_dev_port)],
                                 'Ingress.forwarder.set_egress_port')
        ])

I end up with a rule being written in the table (along with the regular entries the controller populates the forwarder table) which looks like this:

Entry 1:
Entry key:
    hdr.ethernet.dst_addr          : 0x000000000000
    ig_intr_md.ingress_port        : 0x2D
Entry data (action : Ingress.forwarder.set_egress_port):
    egress_port                    : 0x91

Doing a simple ping from the device on the port in question doesn't seem to be redirecting the traffic via the port looping (the frame counter for these ports remain at 0 and I have enabled the relevant ports from the controller) which suggest the rule is not triggered.

I was thinking if I could insert a rule with a wildcard (like *) for the dst_addr part of the key when it matches to the ports but I'm not sure if this is possible. I also cannot set the matching method for both keys to lpm. Any ideas would be welcomed. Thank you.

AmedeoSapio commented 2 years ago

Hi, in your code to add the rule you are missing the value of the dst address you want to match on, so the rule is getting a default value of all zeros, that won't match any packet. This is how you can add a rule with both match fields (replace MAC_ADDRESS with the value you need):

        self.table.entry_add(self.target, [
            self.table.make_key([
                self.gc.KeyTuple('hdr.ethernet.dst_addr', MAC_ADDRESS),
                self.gc.KeyTuple('ig_intr_md.ingress_port', ingress_dev_port)])
        ], [
            self.table.make_data([self.gc.DataTuple('egress_port', egress_dev_port)],
                                 'Ingress.forwarder.set_egress_port')
        ])

To add a rule with a wildcard, you can use ternary match, like this table does: https://github.com/p4lang/p4app-switchML/blob/7ded628eb31a86e80ad0bf65e72a3e94e0e13d05/dev_root/p4/arp_icmp_responder.p4#L66-L80

and this is how you add an entry. https://github.com/p4lang/p4app-switchML/blob/7ded628eb31a86e80ad0bf65e72a3e94e0e13d05/dev_root/controller/arp_icmp_responder.py#L61-L96 A ternary match requires both a value and a mask. Also, when there is a ternary match, you can have overlapping rules (a packet can match more than one rule), so there is an additional priority field to specify which rule should be applied first (0 is the highest priority).

There is also an example about ternary match in the open-tofino repo here.

kfertakis commented 2 years ago

Hi,

Thanks again for the comments and support on this.

I have indeed changed the forwarding table to a ternary match on the keys as shown below:

table forward {
        key = {
            hdr.ethernet.dst_addr : ternary;
            ig_intr_md.ingress_port : ternary;
        }
        actions = {
            set_egress_port;
            flood;
        }
        size = forwarding_table_size;
    }

for which I'm populating an entry for enabling the forwarding of all traffic from ingress_dev_port port to egress_dev_port like so:

self.table.entry_add(
            self.target,
            [
                self.table.make_key([
                    self.gc.KeyTuple('$MATCH_PRIORITY', 0),
                    self.gc.KeyTuple(
                        'hdr.ethernet.dst_addr',  # 48 bits
                        0x000000000000,  # dst_addr mac
                        0x000000000000),
                    self.gc.KeyTuple(
                        'ig_intr_md.ingress_port',  # port
                        ingress_dev_port,
                        0xff)
                ])
            ],
            [
               self.table.make_data([self.gc.DataTuple('egress_port', egress_dev_port)],
                                 'Ingress.forwarder.set_egress_port')
            ])

and then the normal destination mac forwarding rules are added like this:

self.table.entry_add(self.target, [
            self.table.make_key([
                self.gc.KeyTuple('$MATCH_PRIORITY', 1),
                self.gc.KeyTuple('hdr.ethernet.dst_addr', mac_address, 0xffffffffffff),
                self.gc.KeyTuple('ig_intr_md.ingress_port', 0x00, 0x00)
            ])
        ], [
            self.table.make_data([self.gc.DataTuple('egress_port', dev_port)],
                                 'Ingress.forwarder.set_egress_port')
        ])

I am trying to validate the configuration( emulating the bigger topology by loopbacking two ports with an external link and forwarding all traffic from one port via the external link) with the simplest example possible by pinging one machine from another via the port looping with the external link. I'm changing the ports.yaml config in order to configure the switch to use the end of the loopback link as the port of the target machine. However, although traffic does seem to be redirected via the configured loopback link, ping is not successfull. Specifically, I'm commenting out the flood rules configured by the control plane here:

        # Add ports to flood multicast group
        rids_and_ports = [
            (self.all_ports_initial_rid + dp, dp) for dp in fib.keys()
        ]
        success, error_msg = self.pre.add_multicast_nodes(
            self.all_ports_mgid, rids_and_ports)
        if not success:
            return (False, error_msg)

        for r, p in rids_and_ports:
            self.multicast_groups[self.all_ports_mgid][r] = p

in order to track things more clearly through the packet counters of BFRT and what I see is that on one direction of the ping the packet goes through the loopback link but then isn't forwarded to the target's port and on the other way, the packet does seem to be redirected through all hops correctly and exit through the target host's port but is not detected via a regular tcpdump. I have validated the above by running the same model and configuration on the Tofino Model emulator provided by Intel's P4 SDE and seems to work fine meaning ICMP echo request packet delivered on both directions.

Is there something obviously wrong with my approach ? Could there be some other part of SwitchML's functionality (e.g., the icmp_arp_responder) affecting this ? Any suggestion on how to debug this would be appreciated, thanks.

AmedeoSapio commented 2 years ago

My first guess is that the first packet that goes out when you do a ping is an arp packet. That packet will have as destination mac address the broadcast address FF:FF:FF:FF:FF:FF that doesn't match any rule and is supposed to be flooded on all ports using the flood default rule and the multicast group you commented. Try adding the arp rule manually first so that an arp resolution is not needed.

The icmp_arp_responder control is used to answer arp and ping requests addressed directly to the switch, using the MAC/IP addresses that you provide to the controller with the parameters --switch-mac and --switch-ip. I don't think this is affecting your test.

The reason why you don't see the packet with tcpdump could be that the packet format is wrong and dropped by the NIC, or maybe a checksum is incorrect. The best way to debug this is to use the snapshot feature in Tofino, that allows to see how a packet is processed by your P4 program (similar to the tofino model log, but more low level). With a snapshot you can see what rule a packet matched, and what headers are valid when the packet is going out (to check that the format is correct).

kfertakis commented 2 years ago

Thanks again for the comments.

I've managed to identify the issue that was causing the unexpected packet processing behavior. In SwitchML's parser:

https://github.com/p4lang/p4app-switchML/blob/7ded628eb31a86e80ad0bf65e72a3e94e0e13d05/dev_root/p4/parsers.p4#L46-L59

all pipe 1,2,3 ports are considered to be recirculation which follows a specific parsing logic, extracting some SwitchML specific headers. I realized that the extra link I was using to emulate the bigger topology was connected to a pair of front panel ports corresponding to pipe 3, thus all traffic coming in from those ports was redirected to parse_recirculate state and thus was not parsing the packets correctly. Once I brought all the links to ports connected to pipe 1, everything seems to work fine. My switch (bf2556x-1t) has only two pipes shown as 1 and 3 so I suspect pipe 1 corresponds to pipe 0.

Does this mean that we are meant to be using only pipe 1 ports for testing and the rest are used by the SwitchML logic inside the switch?

Thank you,

AmedeoSapio commented 2 years ago

Hi, yes, all that you said is correct. In a 2-pipe switch like yours, there are 2 pipes that sometimes are called 1 and 3 (like in the P/PT column of pm show), and sometimes are called 0 and 1 (this is the dataplane number that we use in the P4 program).

In the P4 program a port number is a 9 bit number (the number in the D_P column of pm show) and the 2 most significant bits are the pipe dataplane number. So, the rule for the number to use in the P4 program (like in the parser code that you linked) is that ports [0,127] are in pipe 0, and ports [128,255] are in pipe 1.

Currently, our program is hardcoding that ports in the pipes 1,2,3 are considered in loopback. I mentioned it in the first answer, and until we make the changes I described, we can only use the ports in pipe 0 to connect to servers. Does this work for you?