Closed SanaaComp closed 4 years ago
S1 and S2 are simple_switch processes? If so, achieving 20 gigabits/sec sounds impossibly high, and I would question how you reached that conclusion.
One possible way that might help to reduce the maximum rate is the command set_queue_rate
in the simple_switch_CLI program, which can connect to at most one simple_switch or simple_switch_grpc process at a time (or at least that is the only way I have ever used it). Using the command help set_queue_rate
shows the following help message:
Set rate of one / all egress queue(s): set_queue_rate <rate_pps> [<egress_port>]
I have never tested this technique myself to see whether it has the intended effect.
I am sure there are other ways.
mininet> iperf s1 s2
*** Iperf: testing TCP bandwidth between s1 and s2
*** Results: ['21.9 Gbits/sec', '21.9 Gbits/sec']
Using set_queue_rate
did nothing.
I tried the iperf
command within mininet on my system and also see answers in units of Gbits/sec
. I really sincerely doubt that these measurements are accurate. I do not know how the iperf
within mininet works, but this throughput is so large, and the CPU utilization on my system so low, that it seems impossible that the reported throughput value could be correct. I would not believe it until someone provides extremely strong evidence that it is correct.
Here is strong evidence that whatever the mininet iperf
command is doing, it is not sending packets that get processed by the simple_switch processes that execute your P4 code:
In a separate terminal, look in the logs
directory, which at least for the exercises/basic
tutorial exercise I tried with, contains files like s1.log s2.log etc., one for each simple_switch process, with names that correspond to the names of the switches in the mininet topology. The contents of those files show detailed tracing information about how the P4 program processed every packet received by the switch that the file corresponds to.
When I run h1 ping h2
, I see new trace lines added to the file s1
(again, my example is for the exercises/basic exercise -- your topology of hosts and switches may differ, but some file should be appended to when a switch processes packets).
When I run iperf s1 s2
, none of those files are updated. So no packets are being processed by any of the simple_switch processes.
For the few seconds that iperf s1 s2
is running, I type this command in a separate terminal window:
ps axguwww | grep perf
the output shows these processes:
$ ps axguww | grep iperf
andy 6823 0.0 0.1 28428 4052 pts/11 S+ 12:01 0:00 man iperf
root 7049 53.2 0.0 313884 3012 pts/7 Sl+ 12:20 0:02 iperf -p 5001 -s
root 7056 86.7 0.0 166420 1948 pts/6 Sl+ 12:20 0:03 iperf -p 5001 -t 5 -c 127.0.0.1
andy 7061 0.0 0.0 21532 1096 pts/10 S+ 12:20 0:00 grep --color=auto iperf
The one with the command line iperf -p 5001 -t 5 -c 127.0.0.1
has the option -c 127.0.0.1
. 127.0.0.1 is the IPv4 loopback address, meaning "the IPv4 address assigned to me, whatever the system is on which the address is resolved". I strongly suspect that these iperf
commands are sending packets to each other not through the mininet emulated network, but directly through the host operating system's loopback interface, which is simply copying the packet from one process to another through the OS kernel.
I think you should be using the mininet iperf
command between hosts, not switches, e.g.
iperf h1 h2
When I try that on the exercises/basic tutorial, I see many packets being traced in the s1.log file for switch s1, which in that exercises's topology, is the only switch on the path from host h1 to h2.
When I did that, I saw performance results of about 19 Mbits/sec, which I can much more readily believe than anything in units of Gbits/sec.
Thanks for clarifying.
I re-opened this issue.
I created a simple network with three hosts and two switches. After running iperf
, the performance results of about 700 Mbits/sec. But the bw in exercises/basic
between h1 and h2 is about 10 Mbits/sec.
Is it because of the P4?
Short answer: yes.
Long answer:
Yes, but the precise performance you see right now is not the fastest that the simple_switch process can do. It has multiple options it can be compiled with, and multiple choices for options on its command line when it is started, that can affect its performance significantly.
Several ways are described here: https://github.com/p4lang/behavioral-model/blob/master/docs/performance.md
Another way is to run simple_switch with neither the --log-console
nor --log-file
command line options, which is happening by default when doing make run
in one of the tutorials exercises. Enabling those options is what produces detailed trace logs of processing packets, which can make debugging an incorrect P4 program much more straightforward, but also slows down packet processing.
How can I run exercises/ecn
without logs?
How much Python do you know, or are willing to learn? It doesn't take very much Python knowledge, and willingness to dig in and add debug print statements, or whatever your favorite learning-how-an-existing-program-works techniques, to figure out where in the Python code included in the tutorials repository that it starts the simple_switch process. You can start by grep'ing all of the files for occurrences of simple_switch, then focus on the occurrences in Python source files. You would need to change it (or find out that perhaps there is already an existing option in the Python code) to start the simple_switch processes without the --log-file some-file-name-here
option.
Some more details on how you can go about this.
I know that when running make run
in one of the tutorials exercises directories, it runs one simple_switch or simple_switch_grpc process for each switch in the network being emulated. So I go to the exercises/ecn directory, run make run
, and after it starts up, switch to a different terminal window and type a command that shows all processes running on the system, with their full command line options. There are multiple such commands, but one is ps axguwww
. That shows all processes, so I narrow it down to only the ones with simple_switch
somewhere in the line of output using grep
, e.g.:
ps axguwww | grep simple_switch
Here are 3 of the lines of output I see when I run that on my system:
root 32633 0.5 0.9 1419860 40028 pts/6 Sl+ 11:32 0:03 simple_switch_grpc -i 1@s1-eth1 -i 2@s1-eth2 -i 3@s1-eth3 -i 4@s1-eth4 --pcap /home/andy/tutorials/master/exercises/ecn/pcaps --nanolog ipc:///tmp/bm-0-log.ipc --device-id 0 build/ecn.json --log-console --thrift-port 9090 -- --grpc-server-addr 0.0.0.0:50051
root 32650 0.7 0.9 1419860 40296 pts/7 Sl+ 11:32 0:03 simple_switch_grpc -i 1@s2-eth1 -i 2@s2-eth2 -i 3@s2-eth3 -i 4@s2-eth4 --pcap /home/andy/tutorials/master/exercises/ecn/pcaps --nanolog ipc:///tmp/bm-1-log.ipc --device-id 1 build/ecn.json --log-console --thrift-port 9091 -- --grpc-server-addr 0.0.0.0:50052
root 32675 0.1 0.9 1350480 36816 pts/8 Sl+ 11:32 0:00 simple_switch_grpc -i 1@s3-eth1 -i 2@s3-eth2 -i 3@s3-eth3 --pcap /home/andy/tutorials/master/exercises/ecn/pcaps --nanolog ipc:///tmp/bm-2-log.ipc --device-id 2 build/ecn.json --log-console --thrift-port 9092 -- --grpc-server-addr 0.0.0.0:50053
3 simple_switch_grpc processes, one for each of the 3 switches in the exercises/ecn network topology.
The --log-console
option is one that causes simple_switch_grpc to print out detailed trace information while processing each packet. You want to no longer use that option.
Let us look at other command line options that might be slowing things down, while we are at it.
--pcap /home/andy/tutorials/master/exercises/ecn/pcaps
causes simple_switch_grpc to write every packet received or sent on every port to a file. You want to no longer use those command line options, either.
I doubt you need the options --nanolog ipc:///tmp/bm-1-log.ipc
except for extra debug kinds of tracing. You can try removing that, and if anything goes wrong, put them back.
Command line options you need to keep, and why:
-i 1@s3-eth1 -i 2@s3-eth2 -i 3@s3-eth3
are necessary for defining what ports exist on the software switch. You need those to be there, or your simple_switch_grpc switches won't have any ports to send and receive packets on.
--device-id 1 build/ecn.json
. --device-id
gives each of the switches a unique id, I think so that the P4Runtime API controller software can send commands to one of them vs. another. In any case, it should not be causing packets to be processed more slowly, that I know of, so keep it. build/ecn.json
is the compiled version of your P4 program, output by the p4c compiler, read by simple_switch_grpc. You need that.
--thrift-port 9092 -- --grpc-server-addr 0.0.0.0:50053
. --thrift-port 9092
tells the process what TCP port to listen on for incoming controller connections using the Thrift API. Keep that. Similarly for --grpc-server-addr
, except that is for listening for incoming controller connections using the P4Runtime API. Keep that, too.
Note that the bandwidth is limited to what is specified in the topology.json file, too. If you run a command like top
while running your iperf
tess, and see that the CPU usage of the simple_switch_grpc processes never goes above a few percent of one CPU core, as I see with the default contents of the topology.json file, the limiting factor is not simple_switch_grpc speed, but the link rate specified in the topology.json file.
You can add other arguments to Bmv2 on this line of the Makefile
for the ecn
exercise.
@jafingerhut Thank you for your very detailed answers. I managed to disable the log and pcap. the bw increased but just for about 1 Mbits/sec. Is this the maximum performance?
You can add other arguments to Bmv2 on this line of the
Makefile
for theecn
exercise.
Yes I can. But which argument can disable log and pcap? After all, I managed to disable them.
Did you increase the bandwidth limit number in the topology.json file? What value is it set to?
Did you run top
while doing an iperf test to see how much CPU the simple_switch_grpc processes were using during that time? How high did it go?
Basically, my earlier short answer "yes" was wrong. There are multiple things that can limit the bandwidth here besides simple_switch_grpc -- such as the link rate assigned to the link between two switches on the path from host h1 to h2, which was put there in the ECN exercise specifically to make the bit rate allowed there lower than what simple_switch_grpc can support, so that queues would build up and some packets would be ECN marked.
If you make it so that simple_switch_grpc is the limiting factor on packet rate, then no queues will build up there, and it will never mark packets as having experienced congestion.
Related to the same issue, I have increased the BW in topology.json file but always the iperf result gives the same constant rate between hosts (~30 Mb/s).It works only If I lower the rate to 20M or 10M. I used the queue rate command on simple switch grps but no change. Can you help in this point?
I have run some simple experiments, which I am not saying these are the point of the ECN exercise, and others will likely get different results because the precise performance results depend upon your CPU model, clock speed, etc., because simple_switch is doing software forwarding on a general purpose CPU. I have a mid 2014 model MacBook Pro with p4c and simple_switch built from recent versions of the source code for those projects, running an Ubuntu 18.04 Desktop Linux OS inside of VirtualBox version 6.0.16. I did not try to disable simple_switch logging or pcap file recording of packets in these experiments.
Steps I took, and things I measured:
iperf h1 h2
. Observe the changing values in the %CPU column in the window running 'top' for the two simple_switch processes that are forwarding the packets, and see how high those numbers get. They go back down shortly after iperf
is finished, since then the packet flow stops. Record a range of those %CPU numbers for that experiment. Also record the two rate numbers reported in the output of the iperf
command.make run
. Do 3 more iperf h1 h2
commands and record the measurements described above.In the topology.json file, the link rate value is the "0.5" in this definition of the link: ["s1-p3", "s2-p3", "0", 0.5]
. I did measurements for all of the values shown in the table.
Glossary:
L U R1 R2
(%)
0.5 6 - 8 480 Kbits/sec 651 Kbits/sec
0.5 6 - 7 480 Kbits/sec 692 Kbits/sec
0.5 5 - 7 480 Kbits/sec 650 Kbits/sec
1.0 8 - 12 958 Kbits/sec 1.18 Mbits/sec
1.0 9 - 13 958 Kbits/sec 1.20 Mbits/sec
1.0 8 - 13 958 Kbits/sec 1.18 Mbits/sec
2.0 20 - 24 1.90 Mbits/sec 2.25 Mbits/sec
2.0 16 - 23 1.91 Mbits/sec 2.18 Mbits/sec
2.0 18 - 23 1.87 Mbits/sec 2.15 Mbits/sec
4.0 29 - 34 3.33 Mbits/sec 3.75 Mbits/sec
4.0 35 - 38 3.78 Mbits/sec 4.28 Mbits/sec
4.0 29 - 34 3.44 Mbits/sec 3.80 Mbits/sec
8.0 58 - 58 7.55 Mbits/sec 8.12 Mbits/sec
8.0 52 - 52 6.08 Mbits/sec 6.61 Mbits/sec
8.0 42 - 42 6.28 Mbits/sec 6.69 Mbits/sec
16.0 117-119 15.2 Mbits/sec 15.9 Mbits/sec
16.0 120-130 15.0 Mbits/sec 15.6 Mbits/sec
16.0 97-100 13.2 Mbits/sec 14.2 Mbits/sec
32.0 89- 91 11.1 Mbits/sec 11.5 Mbits/sec
32.0 133-134 15.9 Mbits/sec 16.6 Mbits/sec
32.0 125-125 14.7 Mbits/sec 16.7 Mbits/sec
Things I do not know, and have not found out yet, but would be useful to answer:
Even without answering those questions, it seems clear that the link rate is a limiting factor on my system up to about a value of 16 or so, but increasing it beyond 16 and it appears that then the limiting factor is the packet forwarding performance of simple_switch. What the link rate value is where it no longer becomes a limiting on your system will likely be different than on mine.
Could I have adjusted other parameters and recorded results, such as the queue rate command configuration you mentioned? Certainly. Could I have measured additional things, such as:
Sure, I could have done that, but I was going here for a "quick" experiment that demonstrates at least one way to make the performance change -- modifying the link rate. Also that the %CPU of the simple_switch process is a useful thing to measure and record in results, because if it is significantly less the 100%, then simple_switch is not the limiting factor in performance, so something else must be.
I would recommend that if you have questions about things, that you record at least this much data about your experiments, and what you changed, and what you did not change. This is pretty standard recording of scientific experiment results in a lab book kinds of tasks. Without being careful and methodical in recording of what you tried, it is not easy to see yourself what you have done, and when you ask others for help, it is easy to leave out details of what you did.
Did you increase the bandwidth limit number in the topology.json file? What value is it set to?
Did you run
top
while doing an iperf test to see how much CPU the simple_switch_grpc processes were using during that time? How high did it go?Basically, my earlier short answer "yes" was wrong. There are multiple things that can limit the bandwidth here besides simple_switch_grpc -- such as the link rate assigned to the link between two switches on the path from host h1 to h2, which was put there in the ECN exercise specifically to make the bit rate allowed there lower than what simple_switch_grpc can support, so that queues would build up and some packets would be ECN marked.
If you make it so that simple_switch_grpc is the limiting factor on packet rate, then no queues will build up there, and it will never mark packets as having experienced congestion.
I set it to 30. And top
shows the CPU utilization of the switch during running iperf
is about 42%.
After making tests like yours, I noticed changing the link rate value neither effects the CPU utilization nor the BW. Dose this mean it is a problem with my VM?
I write my own topology using python (so I can run a .p4 code without the Make
file in exercises
).
My topology consists of three hosts and one switch. After running basic.p4
with my simple network, I used iperf
to measure the BW. The result is 30 Mbits/sec. And using iperf
with exercises/basic
gives me a BW of about 7 Mbits/sec.
I disabled logs and pcap, but still no difference.
One note may help understanding the problem, I tried to run the same program on two different PCs with different specs. The higher processor PC gives me with iperf test BW of 30M and the lower processor PC gives almost 20M with the same configuration in topology.json.
@SanaaComp "After making tests like yours, I noticed changing the link rate value neither effects the CPU utilization nor the BW. Dose this mean it is a problem with my VM?"
"like yours". I don't see a detailed writeup of exactly what you did. If by "like mine", you mean you followed the same set of steps I did in my experimental results, that is one thing, but do you know whether in your experiments that packets are even going across the link that has the bandwidth limit on them? If you are sending traffic between two hosts, using iperf, that does not even cross the link with the rate limit configured on it, then changing that rate limit will not have any effect.
@SanaaComp When you change the rate limit, do you also quit mininet, type make stop
, then make run
again? If you do not, then the changes to the topology.json file will not be read again, so have no effect.
@SanaaComp When you change the rate limit, do you also quit mininet, type
make stop
, thenmake run
again? If you do not, then the changes to the topology.json file will not be read again, so have no effect.
Thank you!
@jafingerhut In a single switch topology with 4 hosts I faced the BW limit problem and any change in the topology.json file takes effect only if the BW is less than 30 Mb/s. I also stop and run the make file after each change with no hope.
@afaheemp4 "with no hope"? While there's life, there is hope! It sounds like you are able to observe some kind of change in behavior if you configure the bandwidth number in the topology.json file less than 30? Did you make a table of experiments you tried that included what you varied, and what you measured, like the one I demonstrated above? If not, then how do you remember what you did?
@jafingerhut . I am using a single switch topology attached to 4 Hosts and I am trying to make Iperf command between h1 and h2 and I got the below results Configured Link Utilization | iperf Measured Utilization | CPU Utilization 10Mb/s | ~9.8Mb/s | ~59% 25 Mb/s | ~24.5Mb/s | ~118 50Mb/s | ~32.2Mb/s | ~166% 90Mb/s | ~31.2Mb/s | ~169% 100Mb/s | ~30.8Mb/s | ~165%
@afaheemp4 So it looks similar to my results, with the only difference that on your system the simple_switch performance maxed out around 30 Mbits/sec, whereas on mine it maxed out around 16 Mbits/sec. The difference in those numbers could be simply because we have different models of CPU on our computers -- I expect those kinds of differences as normal, between different systems.
The summary of why this is probably occurring is exactly the same as I gave above for my results: "it seems clear that the link rate is a limiting factor on your system up to about a value of 30 or so, but increasing it beyond 30 and it appears that then the limiting factor is the packet forwarding performance of simple_switch". (That is a direct copy and paste from my earlier comment, changing 16 to 30, and "my system" to "your system")
Were you expecting the results to be different than what you got? If so, in what way?
I am expecting when I configure 100Mb/s link I got 100Mb/s. I tried to use mininet out of P4 by typing "$ sudo mn" and the iperf can reach normally 100Mb/s or more. If there is a limitation by P4, is there a work around to reach this speed of link?
@jafingerhut already referenced this document (https://github.com/p4lang/behavioral-model/blob/master/docs/performance.md) which includes all the necessary information
you will need to recompile bmv2 with --disable-logging-macros
to get a large performance improvement, but don't expect more than a few 100Mbs.
It's not a limitation of P4, it is a limitation of the bmv2 implementation of a P4 software switch. There are other implementations available, e.g. based on DPDK, but they all come with trade-offs and are often at the prototyping stage.
@afaheemp4 If you put in a terabit per second limit on the link rate, are you expecting 1 terabit per second throughput? 1 petabit per second?
I realize these are crazy questions, and you are probably not expecting that at all. Every method of packet forwarding has a finite maximum rate. bmv2 simple_switch's maximum rate depends upon many factors, as mentioned in the link that Antonin repeated above. It might be lower than you wish, and there are command line options for compiling bmv2 simple_switch, and/or when you run bmv2 simple_switch, that can change that maximum rate, by making the compiled code more efficient (trading off against more debuggability), or by disabling various kinds of logging (again, which trades off debuggability for less work done by the CPU per packet).
Even at the very best options for bmv2 simple_switch performance, a hand-coded assembly program that does the same kinds of operations will be able to go faster than simple_switch does.
@jafingerhut Thanks for the nice words about my questions. I am using only realistic values to simulate normal network enviroments. @antoninbas I have disabled the flags mentioned in the sent link above but unfortunately no enhancement happened. I only need 100 Mb link but I only get 32M.
I suggest that, on your system, with the options you have compiled simple_switch with so far, and the command line options you are running it, you treat that simple_switch program as if that is its maximum rate. Thus if you want to make the bottleneck of the system one of the links, so that it marks packets with ECN marks, you must reduce the link rate lower than 32 Mbits/second.
That is assuming your goal is to continue the ECN exercise in the way it was intended, i.e. make packet marking occur for some packets when the link is congested.
If your goal is to make simple_switch faster, then it depends upon all of the factors at the link above, and the maximum rate achievable on your system depends on your operating system and hardware, and all of those compile-time and simple_switch command line arguments. We can point you at all of those options, but we cannot make you use them correctly.
I'm trying to play around with the bandwidth between S1 and S2 in ECN. I changed the bandwidth to different values but according to
iperf s1 s2
the bandwidth is about 20 Gbits/sec no matter of my value.How can I change the bw? Note: I'm new to P4.