manager fails with flent, poor performance of bridge NF as compared to l2fwd

rohit-mp commented 4 years ago

Bug Report

Current Behavior

onvm manager stop processing pkts after some time while running flent tests, and performance with bridge NF as compared to l2fwd is low.

Steps to reproduce

The following script with lines 30-39 commented out was used to see this behavior: https://gist.github.com/archit-p/2ddfca67fb691140b02c87a5bb56fc04

The topology of the setup:

+------------------------+                                   +------------------------+
|                        |                                   |                        |
|                        +- eth-gs-0 ------------- eth-sg-0 -+                        |
|       Generator        |                                   |          SUT           |
|                        |                                   |                        |
|                        +- eth-gs-1 ------------- eth-sg-1 -+                        |
|                        |                                   |                        |
+------------------------+                                   +------------------------+

On SUT: Commands to run manager: ./onvm/go.sh 0,1,2 3 0xf8 -s stdout NFs run: Bridge NF: ./go.sh 1

On Generator: sudo ./setup-flent.sh

Before running flent on the generator, I'm able to ping 10.0.0.2 from client namespace. On running the script, I observe that after few seconds, the manager stops processing all pkts, i.e., rx_pps goes down to zero, and after the script finishes, I'm no longer able to ping 10.0.0.2 from client namespace as well.

Further, on uncommenting lines 30-35, I see that some flent tests complete (manager still stops processing pkts but now it's just randomly as opposed to everytime before) but the bandwidth observed is capped at around 350Mbps. Whereas running dpdk l2fwd on SUT instead of onvm reaches upto 10Gbps (as seen with a 10G NIC) [Refer results attached below]

Edit: tried out l2 switch, and observed similar behaviour as described above.

Environment

OS: Ubuntu 18.04
onvm version: 20.05 (working on develop branch to include the manager port check fix)

Additional context/Screenshots

Results for performance issue:

Single TCP flow with openNetVM: tcp_nup_1flow

Single TCP flow with l2fwd: l2fwd-1up

@archit-p

twood02 commented 4 years ago

Interesting. We haven't tested with flent before. We will talk about this in our meeting today and see if we can reproduce it.

kevindweb commented 4 years ago

@rohit-mp thanks for testing this. Flent seems like an interesting tool. I've seen this issue a few times before in Pktgen last year but wasn't able to replicate it for testing. When ping cannot connect, what do you need to do to bring up the interface again, is it as simple as unbinding the interface, or an entire reboot? Was there any relevant debug output from ONVM manager or bridge when it ran into these issues?

rohit-mp commented 4 years ago

I've seen this issue a few times before in Pktgen last year but wasn't able to replicate it for testing.

I'm seeing the same issue now with Pktgen as well. Pktgen shows a tx of 1000Mbps but manager shows 0 rx_pps (on a 1G NIC).

When ping cannot connect, what do you need to do to bring up the interface again, is it as simple as unbinding the interface, or an entire reboot?

Restarting the manager resolves the issue for me with both Pktgen and Flent

Was there any relevant debug output from ONVM manager or bridge when it ran into these issues?

There was no debug output from either when this happened

rohit-mp commented 4 years ago

I was trying to check why this happened, and realized that the dpdk plot above was with dpdk20.05 On trying onvm with dpdk20.05 from dpdk-vsn-update, I see that onvm performs as expected and I don't see the issue of the manager stopping the processing anymore as well.

onvm with dpdk20 with 1 tcp flow: 1flow-onvm-dpdk20

To confirm that the issue was with dpdk18.11 and not onvm, I tried l2fwd from dpdk18.11 and encountered the same issue as before. It would be good if someone else could verify the same.

kevindweb commented 4 years ago

I think we're good to close this issue?

sdnfv / openNetVM

manager fails with flent, poor performance of bridge NF as compared to l2fwd #245

Bug Report