p4lang / behavioral-model

The reference P4 software switch
Apache License 2.0
538 stars 328 forks source link

Performance issue of simple_switch_grpc #1172

Closed Johnny-dai-git closed 3 months ago

Johnny-dai-git commented 1 year ago

HI,

I am running some test on cloud lab bare metal machine and I encounter some performance problem while I am running the bmv2(simple_switch_grpc) instance on the metal machine.

I use the perf tool to test the throughput of the physical interface on the bare metal machine and the throughput is around 35Gb/s. However, when I run a simple p4 forwarding program with the simple_switch_grpc instance. The throughput become 24mb/s.

Is there any explanation on this situation ? Or is there anything I can do to solve such performance problem?

Best Regards,

jafingerhut commented 1 year ago

Have you read through this document? https://github.com/p4lang/behavioral-model/blob/main/docs/performance.md

Johnny-dai-git commented 1 year ago

HI,

Yes. I did read through the document and I tried the example in the document. However, in my situation, I encountered a more complicated situation.

Firstly,I am not using mininet for my development. I am trying to convert a VM or bare metal machine into a bmv2 switch. Secondly, I am using simple_switch_grpc instead of simple_switch for my development and project. Thirdly, I am trying to compile it on ubuntu 20.04 instead of ubuntu 18.04. Fourthly, I want the bmv2 instance has better performance.

Based on my experienment, I tried with the script list on https://github.com/jafingerhut/p4-guide/blob/master/bin/README-install-troubleshooting.md.

Actually, I tried v4, v5 and v6. However, none of them can compiled successfully with the optimization flags.

Is there anything I can do to help such situation ?

Best Regards,

jafingerhut commented 1 year ago

I have just started from a freshly installed Ubuntu 20.04 system

I cloned this repo https://github.com/jafingerhut/p4-guide

I followed the instructions at the README link in your comment and used the install-p4dev-v6.sh version of the install script.

Then I ran the following commands, and it successfully cleaned and compiled again the behavioral-model code with optimization enabled:

cd behavioral-model
make clean
./autogen.sh
./configure 'CXXFLAGS=-g -O3' 'CFLAGS=-g -O3' --disable-logging-macros --disable-elogger --with-pi --with-thrift
make
sudo make install
sudo ldconfig

I ran a few simple functional tests with the simple_switch_grpc binary built by the commands above, and they all passed.

Did you try something different than the above?

If so, I would recommend trying exactly the steps above, and if you get an error, please describe exactly the steps you followed and exactly what error you got. Saving ALL of the output of the commands above is recommended, but since the output can be quite long, uploading the file to some other place, e.g. a personal Github repo of yours, and linking to it from here, is preferable to copying and pasting many pagefuls of text into a comment.

Johnny-dai-git commented 1 year ago

HI, Andy

Thank you so much for your replay ! I carefully read your e-mail and do several experiments on the testbed and I found some interesting thing.

The good news is I can successfully compile the software on my testbed and the simple_switch && simple_switch_grpc instance works fine. Also the packet went through the bmv2 instance successfully. However, some strange phenomenon shows up. I will explain it with several experiment result.

At the beginning, I pull a new and clean ubuntu 20.04 image , modified the install-p4dev-v6.sh script with "./configure 'CXXFLAGS=-g -O3' 'CFLAGS=-g -O3' --disable-logging-macros --disable-elogger --with-pi --with-thrift" and run the installation.

Experiment One: After the compilation success, I run the test shows in the link https://github.com/p4lang/behavioral-model/blob/main/docs/performance.md. The results show under the Mininet experiment, throughput is 187mb/sec.

However, my project is not running on the Mininet. Actually, I need to use the bmv2 instance as a switch to steering packet between two physical machines. The topology is like machine1 -> switch -> machine2. The switch is running a simple_switch_grpc instance with only ipv4_forwarding p4 program contains only one table. I tested the speed between machine1 and machine2 with iperf. The result shows 3.17mb/s.(Actually, the thoughput of the interface should be 5Gb/sec).

Experiment Two: I recompile the bmv2 instance and remove all the optimization flag as "./configure --with-pi --with-thrift". The optimization effect should be diminished. After the compilation success, I run the test shows in the following link again. https://github.com/p4lang/behavioral-model/blob/main/docs/performance.md.

This time the result shows throughput as 2mb/sec.

After that I run the experience on the physical machine again. The result still show around 3mb/sec.

Based on the experiment, in my case, we can see that the optimization flags only on the mininet environment with simple_switch bmv2 instance. It has no effect on the physical machine environment with simple_switch_grpc bmv2 instance.

I actually, perform such experinments on cloudlab https://www.cloudlab.us/ with clean Ubuntu 20.04.5 LTS image.

Is there anything I can do to improve the performance on the physical machine?

Best Regards,

jafingerhut commented 1 year ago

I have little experience with trying to achieve good performance with bmv2, or analyzing performance issues with it.

Things that you could try checking:

When you are measuring the maximum packet forwarding performance you are able to achieve with BMv2, what does top say is the simple_switch / simple_switch_grpc CPU utilization? If it is less than 100% of one CPU core when forwarding from one input interface to one output interface, then something else besides the simple_switch process is likely the bottleneck, not simple_switch. For example, perhaps something about the way packet I/O is being done is the performance bottleneck instead.

Johnny-dai-git commented 1 year ago

Hi, Andy

Thank you so much ! I will have a look at it.

Is there any possibility that simple_switch and simple_switch_grpc perform different performance in packet processing ?

Best Regards,

jafingerhut commented 1 year ago

I do not know of any reason why simple_switch and simple_switch_grpc should have any noticeable difference in their packet processing performance. The main difference between them that I know about is that simple_switch_grpc is able to accept table updates via the P4Runtime API message-passing interface, whereas simple_switch cannot.

Johnny-dai-git commented 1 year ago

Hi, Andy

I have tried and observe the cpu utilization. While the bmv2 instance are steering the packets, the CPU utilization change between 0% - 99%. Most of the time under 30%. I guess this may explain why the throughput is so low.

Best Regards,

Johnny-dai-git commented 1 year ago

HI, Andy

Do we have any plan to make the bmv2 a production-grade software application ?

Best Regards,

jafingerhut commented 1 year ago

No one I know of has any plans to make bmv2 a production-grade packet processing device, intended for use in production environments. That could change tomorrow, of course, but I do not expect it to change.

The DPDK back end for P4 is intended to be used in performance-sensitive situations. It probably does not have as many optimizations as it will have a year or two from now, but it is being developed with performance in mind from day one.

Johnny-dai-git commented 1 year ago

HI, Andy

Thank you so much for your suggestion ! I guess it may useful for us to improve the performance of the bmv2 on a bare metal machine. In some cases, we can do the evaluation without running on the tofino switch for faster debug and development.

In my case, I may need to do all the evaluation and experiment on a tofino platform instead on public resource like cloudlab.us.

Best Regards,

antoninbas commented 1 year ago

@Johnny-dai-git I know you closed this issue, but I wanted to add that what you are observing (low throughput with no change even when you use the correct build flags) can sometimes be explained by checksum issues. This can be confirmed by capturing traffic with Wireshark. The general recommendation is to 1) disable all checksum offloads on the NIC, 2) ensure that you compute the checksum correctly in P4.

Johnny-dai-git commented 1 year ago

Hi, Antonin

Thank you so much for your reply ! I understand the scenario you are talking about. Actually, I encounter this problem before. In the cloudlab platform, I have to recompute the IPV4 && TCP checksum in the P4 program. Otherwise, some packet will not go through the bmv2 instance.

However, my P4 program perform by myself actually recompute all the TCP and IPV4 checksum. Also, in order to verity this, I try to control the packet rate of iperf. I found that if the packet rate is low enough all the packets can go through the bmv2 instance. Otherwise, some packet will disappear while they are going through the bmv2 instance.

I do not think the checksum issue cause such problem.

Best Regards,

Johnny-dai-git commented 1 year ago

Since this problem has not been deal with. It is better for me to leave it open.

Best Regards,

hkgb77 commented 1 year ago

Hi,

If this helps anyone -- I tried running simple_switch_grpc with two hosts H1<->switch<->H2 the only tables my P4 program has for matching the l2 MAC address and for matching Ipv4 LPM. I ran some tests with iperf3 and following are the results: Without configure options suggested in this post, I see packet loss beyond 7Mbps. With configure options changed to disable logging (as per Andy's response), I could reach 55-56 Mbps without any loss.

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days