sdnfv / openNetVM

A high performance container-based NFV platform from GW and UCR.
http://sdnfv.github.io/onvm/
Other
263 stars 136 forks source link

Reuse NF instance IDs #92

Closed koolzz closed 5 years ago

koolzz commented 5 years ago

Instead of stopping when we reach MAX_NFS, wrap back to the initial instance ID starting value.

Summary:

Reuse instance IDs of old NFs that have terminated. I've initially implemented an inline function for the while loop so we don't use the code twice, but I revised this as I think the 2 loops with comments just look cleaner.

Usage: I tested with decreasing MAX_NFS number to 4, seems to work didn't test all the small details yet.

This PR includes
Resolves issues
Breaking API changes
Internal API changes
Usability improvements 🎉
Bug fixes
New functionality 👍
New NF/onvm_mgr args
Changes to starting NFs
Dependency updates
Web stats updates

Merging notes:

TODO before merging :

Test Plan:

Try to break this.

Review:

TBA

onvm commented 5 years ago

In response to PR creation

CI Message

Your results will arrive shortly

onvm commented 5 years ago

In response to PR creation

CI Message

Run successful see results: [Results from nimbnode30] Median TX pps for Speed Tester: 35202395

Linter Passed

kevindweb commented 5 years ago

Sorry for my late response, but I found a pretty big issue. Replicated by setting MAX_NFS to any number greater than 1. Then try to run more nfs than that number. Tested with MAX_NFS = 2 and MAX_NFS = 4. When I get this error (correct): Screen Shot 2019-04-12 at 11 32 22 AM

I assume nothing more bad to happen, it should have done that. But, when I try to terminate the first NF initialized (speed_tester: ./go.sh 1 -d 1 -c 16000), it does not respond. Screen Shot 2019-04-12 at 11 32 34 AM

Screen Shot 2019-04-12 at 11 33 23 AM I can explain more about the process I did later if needed.

koolzz commented 5 years ago

This is indeed a weird issue, I'll check later. Any ideas why this happens?

dennisafa commented 5 years ago

This is indeed a weird issue, I'll check later. Any ideas why this happens?

I am looking into the code - I will keep you updated if I find something funky

Edit: I've been trying to replicate this bug again; but I'm having trouble. I follow @kevindweb steps, and assure that the ID we are trying to kill is proper by inserting a print statement:

Screen Shot 2019-04-13 at 8 42 03 AM

However I can easily kill the first NF running, and I am assuring that the new reuse ID function is being used. Additionally, I am making sure to test with setting MAX_NFs to 4 right away, instead of setting to 2 first then attempting to replicate.

Exact steps: 1) Checkout reuse instance ids 2) Limit MAX_NFs to 4, make 3) Run manager 4) Run 4 speed_tester NF's, getting an error code on running the 4th 5) Attempting to kill first NF initialized (working properly, in my cases)

koolzz commented 5 years ago

Agree with @dennisafa I can't seem to reproduce the behavior, @kevindweb are there any key steps we're missing here?

kevindweb commented 5 years ago

Dennis told me he was able to reproduce it the same way I explained. First, I changed MAX_NFS to 2, and started 2 speed_testers, this created an error with the first initialized speed_tester. Then I did the same thing with MAX_NFS=4. I don't know exactly why this happens though.

koolzz commented 5 years ago

Did you make clean && make the NFs?

kevindweb commented 5 years ago

I just tested it the same way with cleaning and making /examples and /onvm again

koolzz commented 5 years ago

Not sure why I can't reproduce, I'll take a look at the lab

dennisafa commented 5 years ago

Dennis told me he was able to reproduce it the same way I explained. First, I changed MAX_NFS to 2, and started 2 speed_testers, this created an error with the first initialized speed_tester. Then I did the same thing with MAX_NFS=4. I don't know exactly why this happens though.

Strangely, this issue pops up only every once in a while. In a few cases, it'll occur on the second or third NF initialized. Could this be an issue unrelated to this specific PR? I would understand if the instance_id variable for the first NF was somehow corrupted when getting the error code (because then we couldn't clean up properly) but it's valid every time I'm testing it..

koolzz commented 5 years ago

@dennisafa @kevindweb maybe you guys where onto something. I found a nasty bug regarding this PR. When running different NFs and reusing the old instance_id they would segfault on pthread_join. After some gdb debugging and being very confused I have found the reason. We never cleaned up old function pointers in the onvm_nf struct. For example: We run speed tester on instance id 1. Then we run a few other NFs, stop speed tester, and start a new simple_forward NF which after a wrap around gets instance id 1. This would segfault because the simple forward would try to call the nf_setup function from the old speed_tester setup. This would not segfault, but it would corrupt our memory badly. I'm fixing this and bundling it into a memory cleanup pr and I'll assign you guys for review as it might have been related to issues you were seeing.

koolzz commented 5 years ago

@onvm go for it

onvm commented 5 years ago

@onvm go for it

CI Message

Your results will arrive shortly

onvm commented 5 years ago

@onvm go for it

CI Message

Run successful see results: [Results from nimbnode30] Median TX pps for Speed Tester: 35204617

Linter Failed

examples/aes_decrypt/aes.h:176: #endif line should be "#endif // _AESH" [build/header_guard] [5] Total errors found: 1 examples/aes_encrypt/aes.h:185: #endif line should be "#endif // _AESH" [build/header_guard] [5] Total errors found: 1 examples/flow_table/flow_table.h:63: #endif line should be "#endif // _FLOW_TABLEH" [build/header_guard] [5] Total errors found: 1 examples/flow_table/msgbuf.h:71: #endif line should be "#endif // _MSGBUFH" [build/header_guard] [5] Total errors found: 1 examples/flow_table/openflow.h:969: #endif line should be "#endif // _OPENFLOWH" [build/header_guard] [5] examples/flow_table/openflow.h:50: Using deprecated casting style. Use static_cast(...) instead [readability/casting] [4] examples/flow_table/openflow.h:569: Extra space before ( in function call [whitespace/parens] [4] examples/flow_table/openflow.h:634: Extra space before ( in function call [whitespace/parens] [4] examples/flow_table/openflow.h:771: Extra space before ( in function call [whitespace/parens] [4] examples/flow_table/openflow.h:804: Extra space before ( in function call [whitespace/parens] [4] examples/flow_table/openflow.h:865: Extra space before ( in function call [whitespace/parens] [4] examples/flow_table/openflow.h:926: Extra space before ( in function call [whitespace/parens] [4] Total errors found: 8 examples/flow_table/sdn.c:334: { should almost always be at the end of the previous line [whitespace/braces] [4] examples/flow_table/sdn.c:365: Should have a space between // and comment [whitespace/comments] [4] Total errors found: 2 examples/flow_table/sdn.h:104: #endif line should be "#endif // _SDNH" [build/header_guard] [5] Total errors found: 1 examples/flow_table/sdn_pkt_list.h:122: #endif line should be "#endif // _SDN_PKT_LISTH" [build/header_guard] [5] Total errors found: 1 examples/flow_table/setupconn.h:53: #endif line should be "#endif // _SETUPCONNH" [build/header_guard] [5] Total errors found: 1 examples/speed_tester/speed_tester.c:364: Lines should be <= 120 characters long [whitespace/line_length] [5] Total errors found: 1 onvm/onvm_mgr/main.c:109: Lines should be <= 120 characters long [whitespace/line_length] [5] Total errors found: 1 onvm/onvm_mgr/onvm_init.c:107: { should almost always be at the end of the previous line [whitespace/braces] [4] onvm/onvm_mgr/onvm_init.c:114: { should almost always be at the end of the previous line [whitespace/braces] [4] onvm/onvm_mgr/onvm_init.c:116: { should almost always be at the end of the previous line [whitespace/braces] [4] Total errors found: 3 onvm/onvm_mgr/onvm_pkt.c:68: Are you taking an address of a cast? This is dangerous: could be a temp var. Take the address before doing the cast, rather than after [runtime/casting] [4] Total errors found: 1 onvm/onvm_nflib/onvm_common.h:366: #endif line should be "#endif // _ONVM_COMMONH" [build/header_guard] [5] Total errors found: 1 onvm/onvm_nflib/onvm_config_common.h:208: #endif line should be "#endif // _ONVM_CONFIG_COMMONH" [build/header_guard] [5] Total errors found: 1 onvm/onvm_nflib/onvm_msg_common.h:61: #endif line should be "#endif // _ONVM_MSG_COMMONH" [build/header_guard] [5] Total errors found: 1 onvm/onvm_nflib/onvm_nflib.c:523: Line ends in whitespace. Consider deleting these extra spaces. [whitespace/end_of_line] [4] Total errors found: 1 onvm/onvm_nflib/onvm_pkt_common.c:98: Are you taking an address of a cast? This is dangerous: could be a temp var. Take the address before doing the cast, rather than after [runtime/casting] [4] Total errors found: 1 onvm/onvm_nflib/onvm_sc_common.h:70: #endif line should be "#endif // _ONVM_SC_COMMONH" [build/header_guard] [5] onvm/onvm_nflib/onvm_sc_common.h:70: Should have a space between // and comment [whitespace/comments] [4] Total errors found: 2 onvm/onvm_nflib/onvm_sc_mgr.h:79: #endif line should be "#endif // _ONVM_SC_MGRH" [build/header_guard] [5] Total errors found: 1