sdnfv / openNetVM

A high performance container-based NFV platform from GW and UCR.
http://sdnfv.github.io/onvm/
Other
263 stars 135 forks source link

Question: Correct way to update stats in NFs #192

Closed rohit-mp closed 4 years ago

rohit-mp commented 4 years ago

Hello team,

I was trying to implement a token bucket NF (code can be found here: https://github.com/rohit-mp/openNetVM/tree/token_bucket/examples/simple_forward_tb). I referred to the scaling example to use the advanced rings mode API and made a few changes.

In my current implementation, I see that txps is sometimes more than rxps for my NF which just forwards or drops packets. So could I know what the right way is to update the stats for an NF?

Thank you for your time!

twood02 commented 4 years ago

Can you clarify exactly which variables you are trying to look at when seeing this behavior?

I'm not certain, but it sounds like this might be a race condition issue -- the RX/TX counts may be being updated by the NF manager running on a different core, so if you read RX first, then read TX, it is not all one atomic operation and both counts may have changed in that time. But I don't remember right now whether the stats are updated by the manager threads or only be the NF itself. Maybe @dennisafa or @koolzz knows more.

rohit-mp commented 4 years ago

I'm looking at rx_pps and tx_pps (which I think is stats.rx and stats.tx) when observing this behavior. Screenshot from 2020-03-08 22-46-10 (Please ignore the unchanged NF_TAG. Just realized that there's a limit on the length)

I am able to update the stats from the NF, but am not sure if it is updated by default in the NF or manager.

koolzz commented 4 years ago

@rohit-mp can you please check with onvm mgr verbose mode, pps seems off to me as the tonf stats are the same.

Glancing at your code it looks like you're properly updating stats

rohit-mp commented 4 years ago

@koolzz I was trying out for different values of tb_depth (depth of my token bucket, translates to max number of packets I process in one burst dequeue operation considering availability of tokens),

tb_depth = 10 Screenshot from 2020-03-08 23-56-54

tb_depth = 100 Screenshot from 2020-03-08 23-58-01

tb_depth = 1000 Screenshot from 2020-03-09 00-00-18

I notice that the output is as expected for tb_depth=10 but for larger values tx increases by a lot. Not sure why this happens.

Sorry for asking too specific an issue of my implementation over here.

twood02 commented 4 years ago

@rohit-mp we will try to take a look at this tomorrow in our group meeting (this will be helpful for some of our new students to learn from). Can you be sure the code in your repository is something we can test to see the bug?

rohit-mp commented 4 years ago

Thanks @twood02 ! I just tested the same in another system and saw the same outputs.

WilliamMaa commented 4 years ago

@rohit-mp we believe this problem happened because we are also updating the stats automatically somewhere else. We tested it by commenting out the part where you manually update the stats in the main loop and the result appears to be normal. Can you try the same thing and see if it works out?

rohit-mp commented 4 years ago

@WilliamMaa I suspected the same too and tried commenting out line 285 and the results are:

tb_depth = 10 Screenshot from 2020-03-10 15-56-56

tb_depth = 100 Screenshot from 2020-03-10 15-56-11

I don't think 0 is a valid value in the first case since 10 packets are continuously being processed. Am I missing something?

Also, why is it being manually updated again in scaling_example if it is also being updated automatically elsewhere?

dennisafa commented 4 years ago

@WilliamMaa I suspected the same too and tried commenting out line 285 and the results are:

tb_depth = 10 Screenshot from 2020-03-10 15-56-56

tb_depth = 100 Screenshot from 2020-03-10 15-56-11

I don't think 0 is a valid value in the first case since 10 packets are continuously being processed. Am I missing something?

Also, why is it being manually updated again in scaling_example if it is also being updated automatically elsewhere?

Hi @rohit-mp . We do update the stats automatically in the tx_thread In this case, the tx stats will only update when the PACKET_READ_SIZE amount is hit. With a tb_depth of 10, you are only processing batches of 10 packets at a time. The tx_thread is responsible for updating the source NF's stats, which doesn't happen unless that size is hit. The speed_tester NF flushes its own rx queue, so it is still able to to receive and transmit packets. We'll try to make it so that the tx_thread updates stats even when the batch size is less than PACKET_READ_SIZE, in this case 32. If you use a tb_depth >= 32 you will not experience this problem. Also, we will update our scaling API - it does not seem correct to update stats ourselves. For future reference, stats should not be updated manually in advanced rings.

rohit-mp commented 4 years ago

Ahh alright. Just verified that tb_depth >= 32 works as expected. I'll set a minimum limit to tb_depth for now. Thanks for the help!

For future reference, stats should not be updated manually in advanced rings.

But I still need to update stats.act_drop and stats.tx_drop right? I can't seem to verify if not updating tx_drop will be right, but drop count remains 0 when I don't update act_drop.

dennisafa commented 4 years ago

I can verify that stats.tx_drop is working properly. By setting the meta->action value, the tx thread will update the stats properly. The reason that the drop stat isn't updating is because you are only enqueueing tx_batch_size onto the tx ring. So when the tx_thread dequeues those packets and decides what to do with them, the packets that you labeled as meta->action = ONVM_NF_ACTION_DROP are never found. To fix this, treat the "dropped" packets legitimately until the tx_thread decides to actually discard them:

for(i = tb_depth; i < nb_pkts; i++) {
         meta = onvm_get_pkt_meta((struct rte_mbuf *)pkts[i]);
         meta->action = ONVM_NF_ACTION_DROP;
         pktsTX[tx_batch_size++] = pkts[i];
}

When these packets are evaluated by the TX thread, stats will be properly updated. We plan on revamping our advanced rings API to make this a little bit more clear.

rohit-mp commented 4 years ago

Thanks for the inputs @dennisafa . I've updated my code and everything seems to be working fine now.

As an additional request, could we have more examples on using the advanced rings API? It would help a lot in understanding.

Also, the Advanced Ring Manipulation docs says "Additionally, the NF is responsible for maintaining its own statistics" which needs to be updated.

dennisafa commented 4 years ago

Thanks for the inputs @dennisafa . I've updated my code and everything seems to be working fine now.

As an additional request, could we have more examples on using the advanced rings API? It would help a lot in understanding.

Also, the Advanced Ring Manipulation docs says "Additionally, the NF is responsible for maintaining its own statistics" which needs to be updated.

Definitely, we will update our advanced rings mode and create more examples in the process. I will be opening a separate issue to reflect these needed changes shortly.