Bandwidth in ECN tutorial

SanaaComp commented 4 years ago

I'm trying to play around with the bandwidth between S1 and S2 in ECN. I changed the bandwidth to different values but according to iperf s1 s2 the bandwidth is about 20 Gbits/sec no matter of my value.

How can I change the bw? Note: I'm new to P4.

jafingerhut commented 4 years ago

S1 and S2 are simple_switch processes? If so, achieving 20 gigabits/sec sounds impossibly high, and I would question how you reached that conclusion.

One possible way that might help to reduce the maximum rate is the command set_queue_rate in the simple_switch_CLI program, which can connect to at most one simple_switch or simple_switch_grpc process at a time (or at least that is the only way I have ever used it). Using the command help set_queue_rate shows the following help message:

Set rate of one / all egress queue(s): set_queue_rate <rate_pps> [<egress_port>]

I have never tested this technique myself to see whether it has the intended effect.

I am sure there are other ways.

SanaaComp commented 4 years ago

mininet> iperf s1 s2
*** Iperf: testing TCP bandwidth between s1 and s2 
*** Results: ['21.9 Gbits/sec', '21.9 Gbits/sec']

Using set_queue_rate did nothing.

jafingerhut commented 4 years ago

I tried the iperf command within mininet on my system and also see answers in units of Gbits/sec. I really sincerely doubt that these measurements are accurate. I do not know how the iperf within mininet works, but this throughput is so large, and the CPU utilization on my system so low, that it seems impossible that the reported throughput value could be correct. I would not believe it until someone provides extremely strong evidence that it is correct.

jafingerhut commented 4 years ago

Here is strong evidence that whatever the mininet iperf command is doing, it is not sending packets that get processed by the simple_switch processes that execute your P4 code:

In a separate terminal, look in the logs directory, which at least for the exercises/basic tutorial exercise I tried with, contains files like s1.log s2.log etc., one for each simple_switch process, with names that correspond to the names of the switches in the mininet topology. The contents of those files show detailed tracing information about how the P4 program processed every packet received by the switch that the file corresponds to.

When I run h1 ping h2, I see new trace lines added to the file s1 (again, my example is for the exercises/basic exercise -- your topology of hosts and switches may differ, but some file should be appended to when a switch processes packets).

When I run iperf s1 s2, none of those files are updated. So no packets are being processed by any of the simple_switch processes.

For the few seconds that iperf s1 s2 is running, I type this command in a separate terminal window:

ps axguwww | grep perf

the output shows these processes:

$ ps axguww | grep iperf
andy      6823  0.0  0.1  28428  4052 pts/11   S+   12:01   0:00 man iperf
root      7049 53.2  0.0 313884  3012 pts/7    Sl+  12:20   0:02 iperf -p 5001 -s
root      7056 86.7  0.0 166420  1948 pts/6    Sl+  12:20   0:03 iperf -p 5001 -t 5 -c 127.0.0.1
andy      7061  0.0  0.0  21532  1096 pts/10   S+   12:20   0:00 grep --color=auto iperf

The one with the command line iperf -p 5001 -t 5 -c 127.0.0.1 has the option -c 127.0.0.1. 127.0.0.1 is the IPv4 loopback address, meaning "the IPv4 address assigned to me, whatever the system is on which the address is resolved". I strongly suspect that these iperf commands are sending packets to each other not through the mininet emulated network, but directly through the host operating system's loopback interface, which is simply copying the packet from one process to another through the OS kernel.

jafingerhut commented 4 years ago

I think you should be using the mininet iperf command between hosts, not switches, e.g.

iperf h1 h2

When I try that on the exercises/basic tutorial, I see many packets being traced in the s1.log file for switch s1, which in that exercises's topology, is the only switch on the path from host h1 to h2.

When I did that, I saw performance results of about 19 Mbits/sec, which I can much more readily believe than anything in units of Gbits/sec.

SanaaComp commented 4 years ago

Thanks for clarifying.

SanaaComp commented 4 years ago

I re-opened this issue. I created a simple network with three hosts and two switches. After running iperf, the performance results of about 700 Mbits/sec. But the bw in exercises/basic between h1 and h2 is about 10 Mbits/sec.

Is it because of the P4?

jafingerhut commented 4 years ago

Short answer: yes.

Long answer:

Yes, but the precise performance you see right now is not the fastest that the simple_switch process can do. It has multiple options it can be compiled with, and multiple choices for options on its command line when it is started, that can affect its performance significantly.

Several ways are described here: https://github.com/p4lang/behavioral-model/blob/master/docs/performance.md

Another way is to run simple_switch with neither the --log-console nor --log-file command line options, which is happening by default when doing make run in one of the tutorials exercises. Enabling those options is what produces detailed trace logs of processing packets, which can make debugging an incorrect P4 program much more straightforward, but also slows down packet processing.

SanaaComp commented 4 years ago

How can I run exercises/ecn without logs?

jafingerhut commented 4 years ago

How much Python do you know, or are willing to learn? It doesn't take very much Python knowledge, and willingness to dig in and add debug print statements, or whatever your favorite learning-how-an-existing-program-works techniques, to figure out where in the Python code included in the tutorials repository that it starts the simple_switch process. You can start by grep'ing all of the files for occurrences of simple_switch, then focus on the occurrences in Python source files. You would need to change it (or find out that perhaps there is already an existing option in the Python code) to start the simple_switch processes without the --log-file some-file-name-here option.

jafingerhut commented 4 years ago

Some more details on how you can go about this.

I know that when running make run in one of the tutorials exercises directories, it runs one simple_switch or simple_switch_grpc process for each switch in the network being emulated. So I go to the exercises/ecn directory, run make run, and after it starts up, switch to a different terminal window and type a command that shows all processes running on the system, with their full command line options. There are multiple such commands, but one is ps axguwww. That shows all processes, so I narrow it down to only the ones with simple_switch somewhere in the line of output using grep, e.g.:

ps axguwww | grep simple_switch

Here are 3 of the lines of output I see when I run that on my system:

root     32633  0.5  0.9 1419860 40028 pts/6   Sl+  11:32   0:03 simple_switch_grpc -i 1@s1-eth1 -i 2@s1-eth2 -i 3@s1-eth3 -i 4@s1-eth4 --pcap /home/andy/tutorials/master/exercises/ecn/pcaps --nanolog ipc:///tmp/bm-0-log.ipc --device-id 0 build/ecn.json --log-console --thrift-port 9090 -- --grpc-server-addr 0.0.0.0:50051
root     32650  0.7  0.9 1419860 40296 pts/7   Sl+  11:32   0:03 simple_switch_grpc -i 1@s2-eth1 -i 2@s2-eth2 -i 3@s2-eth3 -i 4@s2-eth4 --pcap /home/andy/tutorials/master/exercises/ecn/pcaps --nanolog ipc:///tmp/bm-1-log.ipc --device-id 1 build/ecn.json --log-console --thrift-port 9091 -- --grpc-server-addr 0.0.0.0:50052
root     32675  0.1  0.9 1350480 36816 pts/8   Sl+  11:32   0:00 simple_switch_grpc -i 1@s3-eth1 -i 2@s3-eth2 -i 3@s3-eth3 --pcap /home/andy/tutorials/master/exercises/ecn/pcaps --nanolog ipc:///tmp/bm-2-log.ipc --device-id 2 build/ecn.json --log-console --thrift-port 9092 -- --grpc-server-addr 0.0.0.0:50053

3 simple_switch_grpc processes, one for each of the 3 switches in the exercises/ecn network topology.

The --log-console option is one that causes simple_switch_grpc to print out detailed trace information while processing each packet. You want to no longer use that option.

Let us look at other command line options that might be slowing things down, while we are at it.

--pcap /home/andy/tutorials/master/exercises/ecn/pcaps causes simple_switch_grpc to write every packet received or sent on every port to a file. You want to no longer use those command line options, either.

I doubt you need the options --nanolog ipc:///tmp/bm-1-log.ipc except for extra debug kinds of tracing. You can try removing that, and if anything goes wrong, put them back.

Command line options you need to keep, and why:

-i 1@s3-eth1 -i 2@s3-eth2 -i 3@s3-eth3 are necessary for defining what ports exist on the software switch. You need those to be there, or your simple_switch_grpc switches won't have any ports to send and receive packets on.

--device-id 1 build/ecn.json. --device-id gives each of the switches a unique id, I think so that the P4Runtime API controller software can send commands to one of them vs. another. In any case, it should not be causing packets to be processed more slowly, that I know of, so keep it. build/ecn.json is the compiled version of your P4 program, output by the p4c compiler, read by simple_switch_grpc. You need that.

--thrift-port 9092 -- --grpc-server-addr 0.0.0.0:50053. --thrift-port 9092 tells the process what TCP port to listen on for incoming controller connections using the Thrift API. Keep that. Similarly for --grpc-server-addr, except that is for listening for incoming controller connections using the P4Runtime API. Keep that, too.

jafingerhut commented 4 years ago

Note that the bandwidth is limited to what is specified in the topology.json file, too. If you run a command like top while running your iperf tess, and see that the CPU usage of the simple_switch_grpc processes never goes above a few percent of one CPU core, as I see with the default contents of the topology.json file, the limiting factor is not simple_switch_grpc speed, but the link rate specified in the topology.json file.

jnfoster commented 4 years ago

You can add other arguments to Bmv2 on this line of the Makefile for the ecn exercise.

SanaaComp commented 4 years ago

@jafingerhut Thank you for your very detailed answers. I managed to disable the log and pcap. the bw increased but just for about 1 Mbits/sec. Is this the maximum performance?

SanaaComp commented 4 years ago

You can add other arguments to Bmv2 on this line of the Makefile for the ecn exercise.

Yes I can. But which argument can disable log and pcap? After all, I managed to disable them.

jafingerhut commented 4 years ago

Did you increase the bandwidth limit number in the topology.json file? What value is it set to?

Did you run top while doing an iperf test to see how much CPU the simple_switch_grpc processes were using during that time? How high did it go?

Basically, my earlier short answer "yes" was wrong. There are multiple things that can limit the bandwidth here besides simple_switch_grpc -- such as the link rate assigned to the link between two switches on the path from host h1 to h2, which was put there in the ECN exercise specifically to make the bit rate allowed there lower than what simple_switch_grpc can support, so that queues would build up and some packets would be ECN marked.

If you make it so that simple_switch_grpc is the limiting factor on packet rate, then no queues will build up there, and it will never mark packets as having experienced congestion.

afaheemp4 commented 4 years ago

Related to the same issue, I have increased the BW in topology.json file but always the iperf result gives the same constant rate between hosts (~30 Mb/s).It works only If I lower the rate to 20M or 10M. I used the queue rate command on simple switch grps but no change. Can you help in this point?

jafingerhut commented 4 years ago

I have run some simple experiments, which I am not saying these are the point of the ECN exercise, and others will likely get different results because the precise performance results depend upon your CPU model, clock speed, etc., because simple_switch is doing software forwarding on a general purpose CPU. I have a mid 2014 model MacBook Pro with p4c and simple_switch built from recent versions of the source code for those projects, running an Ubuntu 18.04 Desktop Linux OS inside of VirtualBox version 6.0.16. I did not try to disable simple_switch logging or pcap file recording of packets in these experiments.

Steps I took, and things I measured:

start with an unmodified latest master version of this repository code
cd exercises/ecn
cp solution/ecn.p4 .
make run
In a separate terminal window visible the whole time during these experiments, run 'top'. Leave it running.
back at mininet prompt, enter the command iperf h1 h2. Observe the changing values in the %CPU column in the window running 'top' for the two simple_switch processes that are forwarding the packets, and see how high those numbers get. They go back down shortly after iperf is finished, since then the packet flow stops. Record a range of those %CPU numbers for that experiment. Also record the two rate numbers reported in the output of the iperf command.
Make a table of results, like the one below.
After running 3 experiments without changing anything else, quit Mininet, edit the file topology.json to edit the number that is the link rate from switch s1 to s2 (see below) to a different rate, and start mininet again using the command make run. Do 3 more iperf h1 h2 commands and record the measurements described above.

In the topology.json file, the link rate value is the "0.5" in this definition of the link: ["s1-p3", "s2-p3", "0", 0.5]. I did measurements for all of the values shown in the table.

Glossary:

L is the rate limit configured in the file topology.json for the link ["s1-p3", "s2-p3"] from switch s1 to s2.
U is the CPU utilization reported by top running during the execution of 'iperf h1 h2'. These numbers vary, so sometimes this is reported as a range of several measurements seen by 'top' during that time interval.
R1 and R2 are the rates measured and reported in the output of the command 'iperf h1 h2'. R1 is the first number, R2 is the second number.

L     U           R1              R2
      (%)      

0.5    6 -  8    480 Kbits/sec   651 Kbits/sec
0.5    6 -  7    480 Kbits/sec   692 Kbits/sec
0.5    5 -  7    480 Kbits/sec   650 Kbits/sec

1.0    8 - 12    958 Kbits/sec   1.18 Mbits/sec
1.0    9 - 13    958 Kbits/sec   1.20 Mbits/sec
1.0    8 - 13    958 Kbits/sec   1.18 Mbits/sec

2.0   20 - 24   1.90 Mbits/sec  2.25 Mbits/sec
2.0   16 - 23   1.91 Mbits/sec  2.18 Mbits/sec
2.0   18 - 23   1.87 Mbits/sec  2.15 Mbits/sec

4.0   29 - 34   3.33 Mbits/sec  3.75 Mbits/sec
4.0   35 - 38   3.78 Mbits/sec  4.28 Mbits/sec
4.0   29 - 34   3.44 Mbits/sec  3.80 Mbits/sec

8.0   58 - 58   7.55 Mbits/sec  8.12 Mbits/sec
8.0   52 - 52   6.08 Mbits/sec  6.61 Mbits/sec
8.0   42 - 42   6.28 Mbits/sec  6.69 Mbits/sec

16.0  117-119   15.2 Mbits/sec  15.9 Mbits/sec
16.0  120-130   15.0 Mbits/sec  15.6 Mbits/sec
16.0   97-100   13.2 Mbits/sec  14.2 Mbits/sec

32.0   89- 91   11.1 Mbits/sec  11.5 Mbits/sec
32.0  133-134   15.9 Mbits/sec  16.6 Mbits/sec
32.0  125-125   14.7 Mbits/sec  16.7 Mbits/sec

Things I do not know, and have not found out yet, but would be useful to answer:

What are the units of the number L intended to be?
Why does iperf report two numbers? What do they measure?

Even without answering those questions, it seems clear that the link rate is a limiting factor on my system up to about a value of 16 or so, but increasing it beyond 16 and it appears that then the limiting factor is the packet forwarding performance of simple_switch. What the link rate value is where it no longer becomes a limiting on your system will likely be different than on mine.

Could I have adjusted other parameters and recorded results, such as the queue rate command configuration you mentioned? Certainly. Could I have measured additional things, such as:

What fraction of packets were modified in their IPv4 header ECN bits because the if condition "standard_metadata.enq_qdepth >= ECN_THRESHOLD" was true for them when they were processed?
What was the distribution of queue lengths that built up during the test? For example, I could have tried writing a program that searched through the s1.log file output by simple_switch, after adding a 'debug table' to the P4 program that showed in the log what the enq_qdepth value was for all packets.

Sure, I could have done that, but I was going here for a "quick" experiment that demonstrates at least one way to make the performance change -- modifying the link rate. Also that the %CPU of the simple_switch process is a useful thing to measure and record in results, because if it is significantly less the 100%, then simple_switch is not the limiting factor in performance, so something else must be.

I would recommend that if you have questions about things, that you record at least this much data about your experiments, and what you changed, and what you did not change. This is pretty standard recording of scientific experiment results in a lab book kinds of tasks. Without being careful and methodical in recording of what you tried, it is not easy to see yourself what you have done, and when you ask others for help, it is easy to leave out details of what you did.

SanaaComp commented 4 years ago

Did you increase the bandwidth limit number in the topology.json file? What value is it set to?

Did you run top while doing an iperf test to see how much CPU the simple_switch_grpc processes were using during that time? How high did it go?

Basically, my earlier short answer "yes" was wrong. There are multiple things that can limit the bandwidth here besides simple_switch_grpc -- such as the link rate assigned to the link between two switches on the path from host h1 to h2, which was put there in the ECN exercise specifically to make the bit rate allowed there lower than what simple_switch_grpc can support, so that queues would build up and some packets would be ECN marked.

If you make it so that simple_switch_grpc is the limiting factor on packet rate, then no queues will build up there, and it will never mark packets as having experienced congestion.

I set it to 30. And top shows the CPU utilization of the switch during running iperf is about 42%.

SanaaComp commented 4 years ago

After making tests like yours, I noticed changing the link rate value neither effects the CPU utilization nor the BW. Dose this mean it is a problem with my VM?

SanaaComp commented 4 years ago

I write my own topology using python (so I can run a .p4 code without the Make file in exercises). My topology consists of three hosts and one switch. After running basic.p4 with my simple network, I used iperf to measure the BW. The result is 30 Mbits/sec. And using iperf with exercises/basic gives me a BW of about 7 Mbits/sec.

I disabled logs and pcap, but still no difference.

afaheemp4 commented 4 years ago

One note may help understanding the problem, I tried to run the same program on two different PCs with different specs. The higher processor PC gives me with iperf test BW of 30M and the lower processor PC gives almost 20M with the same configuration in topology.json.

jafingerhut commented 4 years ago

@SanaaComp "After making tests like yours, I noticed changing the link rate value neither effects the CPU utilization nor the BW. Dose this mean it is a problem with my VM?"

"like yours". I don't see a detailed writeup of exactly what you did. If by "like mine", you mean you followed the same set of steps I did in my experimental results, that is one thing, but do you know whether in your experiments that packets are even going across the link that has the bandwidth limit on them? If you are sending traffic between two hosts, using iperf, that does not even cross the link with the rate limit configured on it, then changing that rate limit will not have any effect.

jafingerhut commented 4 years ago

@SanaaComp When you change the rate limit, do you also quit mininet, type make stop, then make run again? If you do not, then the changes to the topology.json file will not be read again, so have no effect.

SanaaComp commented 4 years ago

@SanaaComp When you change the rate limit, do you also quit mininet, type make stop, then make run again? If you do not, then the changes to the topology.json file will not be read again, so have no effect.

Thank you!

afaheemp4 commented 4 years ago

@jafingerhut In a single switch topology with 4 hosts I faced the BW limit problem and any change in the topology.json file takes effect only if the BW is less than 30 Mb/s. I also stop and run the make file after each change with no hope.

jafingerhut commented 4 years ago

@afaheemp4 "with no hope"? While there's life, there is hope! It sounds like you are able to observe some kind of change in behavior if you configure the bandwidth number in the topology.json file less than 30? Did you make a table of experiments you tried that included what you varied, and what you measured, like the one I demonstrated above? If not, then how do you remember what you did?

afaheemp4 commented 4 years ago

@jafingerhut . I am using a single switch topology attached to 4 Hosts and I am trying to make Iperf command between h1 and h2 and I got the below results Configured Link Utilization | iperf Measured Utilization | CPU Utilization 10Mb/s | ~9.8Mb/s | ~59% 25 Mb/s | ~24.5Mb/s | ~118 50Mb/s | ~32.2Mb/s | ~166% 90Mb/s | ~31.2Mb/s | ~169% 100Mb/s | ~30.8Mb/s | ~165%

jafingerhut commented 4 years ago

@afaheemp4 So it looks similar to my results, with the only difference that on your system the simple_switch performance maxed out around 30 Mbits/sec, whereas on mine it maxed out around 16 Mbits/sec. The difference in those numbers could be simply because we have different models of CPU on our computers -- I expect those kinds of differences as normal, between different systems.

The summary of why this is probably occurring is exactly the same as I gave above for my results: "it seems clear that the link rate is a limiting factor on your system up to about a value of 30 or so, but increasing it beyond 30 and it appears that then the limiting factor is the packet forwarding performance of simple_switch". (That is a direct copy and paste from my earlier comment, changing 16 to 30, and "my system" to "your system")

Were you expecting the results to be different than what you got? If so, in what way?

afaheemp4 commented 4 years ago

I am expecting when I configure 100Mb/s link I got 100Mb/s. I tried to use mininet out of P4 by typing "$ sudo mn" and the iperf can reach normally 100Mb/s or more. If there is a limitation by P4, is there a work around to reach this speed of link?

antoninbas commented 4 years ago

@jafingerhut already referenced this document (https://github.com/p4lang/behavioral-model/blob/master/docs/performance.md) which includes all the necessary information

you will need to recompile bmv2 with --disable-logging-macros to get a large performance improvement, but don't expect more than a few 100Mbs.

antoninbas commented 4 years ago

It's not a limitation of P4, it is a limitation of the bmv2 implementation of a P4 software switch. There are other implementations available, e.g. based on DPDK, but they all come with trade-offs and are often at the prototyping stage.

jafingerhut commented 4 years ago

@afaheemp4 If you put in a terabit per second limit on the link rate, are you expecting 1 terabit per second throughput? 1 petabit per second?

I realize these are crazy questions, and you are probably not expecting that at all. Every method of packet forwarding has a finite maximum rate. bmv2 simple_switch's maximum rate depends upon many factors, as mentioned in the link that Antonin repeated above. It might be lower than you wish, and there are command line options for compiling bmv2 simple_switch, and/or when you run bmv2 simple_switch, that can change that maximum rate, by making the compiled code more efficient (trading off against more debuggability), or by disabling various kinds of logging (again, which trades off debuggability for less work done by the CPU per packet).

Even at the very best options for bmv2 simple_switch performance, a hand-coded assembly program that does the same kinds of operations will be able to go faster than simple_switch does.

afaheemp4 commented 4 years ago

@jafingerhut Thanks for the nice words about my questions. I am using only realistic values to simulate normal network enviroments. @antoninbas I have disabled the flags mentioned in the sent link above but unfortunately no enhancement happened. I only need 100 Mb link but I only get 32M.

jafingerhut commented 4 years ago

I suggest that, on your system, with the options you have compiled simple_switch with so far, and the command line options you are running it, you treat that simple_switch program as if that is its maximum rate. Thus if you want to make the bottleneck of the system one of the links, so that it marks packets with ECN marks, you must reduce the link rate lower than 32 Mbits/second.

That is assuming your goal is to continue the ECN exercise in the way it was intended, i.e. make packet marking occur for some packets when the link is congested.

If your goal is to make simple_switch faster, then it depends upon all of the factors at the link above, and the maximum rate achievable on your system depends on your operating system and hardware, and all of those compile-time and simple_switch command line arguments. We can point you at all of those options, but we cannot make you use them correctly.

p4lang / tutorials

Bandwidth in ECN tutorial #327