sdnfv / openNetVM

A high performance container-based NFV platform from GW and UCR.
http://sdnfv.github.io/onvm/
Other
261 stars 134 forks source link

Duplicated flush in `onvm_pkt_process_rx_batch` #281

Closed JackKuo-tw closed 3 years ago

JackKuo-tw commented 3 years ago

Bug Report

Current Behavior

The function onvm_pkt_process_rx_batch calls the following 2 function, and both of them call onvm_pkt_flush_nf_queue finally, which could lower the performance.

Expected behavior/code

Call it once only.

Steps to reproduce Provide command line instructions if possible.

Environment

Possible Solution

Because onvm_pkt_process_tx_batch also calls onvm_pkt_enqueue_nf and do NOT calls any flush function, I think there are 3 solutions:

  1. Remove onvm_pkt_flush_all_nfs in onvm_pkt_process_rx_batch
  2. Add a control parameter to onvm_pkt_enqueue_nf, which can control whether to call onvm_pkt_flush_nf_queue
  3. Remove onvm_pkt_flush_nf_queue from onvm_pkt_enqueue_nf, and add onvm_pkt_flush_nf_queue to onvm_pkt_process_tx_batch

BTW

I trace this code because I found that if I have multiple simple_forward the first one in the chain must consume much more CPU usage than others, this is so strange.

But I think this issue could not solve my problem, I haven't verified this guess.

JackKuo-tw commented 3 years ago

I have tried solution 1 (comment out onvm_pkt_flush_all_nfs) The simple_forward CPU usage decreases from 62 to 50%!

Env

VNF Chain: fwd -> fwd -> fwd -> firewall ONVM: enable shared cores pktgen: 5.7 M pkt/s, UDP, 64 bytes

Performance

==CPU USage==

twood02 commented 3 years ago

Thanks @JackKuo-tw.

I believe the reason we have onvm_pkt_flush_all_nfs in onvm_pkt_process_rx_batch is that otherwise if there is very low traffic you might never correctly flush out the packets. The other flushes only happen if a buffer is full, but what if that condition isn't met?

Can you test to see how your solution behaves under a low traffic rate, for example just using a ping command to generate 1 packet a second?

In any case, it is interesting that you find this consuming a large amount of CPU. We have not done much profiling to find the main sources of overhead.

JackKuo-tw commented 3 years ago

@twood02 You're right, solution 1 blocks all packets if the buffer count is lower than PACKET_READ_SIZE (default 32).

It's a trade-off, for either low latency or high throughput.

AFAIK, the following modification should reach our goal, which uses the timeout concept:

  1. move onvm_pkt_flush_all_nfs() from onvm_pkt.c to main.c
  2. Add delay flush code
    • If only a few packets in, it reaches flush_threshold faster than lots of packets condition
    • If lots of packets in, buffer reaches PACKET_READ_SIZE soon and flushed in onvm_pkt_enqueue_nf
rx_thread_main(void *arg) {
        ...
!!      int flush_threshold = 0;
!!      int enable_flush_count = 0;
        ...
        for (; worker_keep_running;) {
                /* Read ports */
                for (i = 0; i < ports->num_ports; i++) {
                        rx_count = rte_eth_rx_burst(ports->id[i], rx_mgr->id, pkts, PACKET_READ_SIZE);
                        ports->rx_stats.rx[ports->id[i]] += rx_count;

                        /* Now process the NIC packets read */
                        if (likely(rx_count > 0)) {
                                // If there is no running NF, we drop all the packets of the batch.
                                if (!num_nfs) {
                                        onvm_pkt_drop_batch(pkts, rx_count);
                                } else {
                                        onvm_pkt_process_rx_batch(rx_mgr, pkts, rx_count);
!!                                      enable_flush_count = 1;
                                }
                        }
                }
!!              if (enable_flush_count && flush_threshold++ == 1000) {
!!                  onvm_pkt_flush_all_nfs(rx_mgr, NULL);
!!                  flush_threshold = 0;
!!                  enable_flush_count = 0;
!!              }
        }
        ...
}

flush_threshold can be set as macro, which can be tuned by users.

In my test, CPU usage can be as low as solution 1, my CPU is i7-7920X.

About additional latency, I insert gettimeofday() to the aforementioned code, and change the threshold to 100000000, get the value from 200~950 ms, which means the lapse of time of 100000000 - 1.

So if we adopt 1000 as the threshold, the impact is < 0.01 ms, which is a very small overhead but can solve this problem.

JackKuo-tw commented 3 years ago

@twood02 Even though the NFs (compare to the manager) take the same strategy to flush buffer (twice), the CPU usages don't raise. I examine the batch size of each NF every 100K batch processing time for a few seconds and figure it out. Here is a sample batch size:

FWD1: 32, 24,  8, 24,  8, 24,  8, 24, 32,  8, 32
FWD2: 32, 32 ,32,  8, 32, 32, 32, 32
FWD3: 32, 32, 32, 24, 32, 32, 32, 24
FWD4: 32, 32, 32, 24, 32,  8, 32, 32
FW:   32, 24, 32, 32, 32, 32, 32, 32

We can see the batch sizes are uneven when NF is at the beginning of the chain. And as time goes on, they get much more even.

I think the key reason is the long processing time, which causes the buffer to accumulate more packets. This concept is like the "timespan" in the SIGCOMM paper "Microscope", which means the time between the first packet and the last packet leave the NF.

JackKuo-tw commented 3 years ago

I think this is a necessary evil and would not PR this code.

My modification in this issue could be tuned by those who need it.