Closed koolzz closed 5 years ago
In response to PR creation
Your results will arrive shortly
In response to PR creation
Run successful see results: [Results from nimbnode30] Median TX pps for Speed Tester: 35202395
Sorry for my late response, but I found a pretty big issue. Replicated by setting MAX_NFS to any number greater than 1. Then try to run more nfs than that number. Tested with MAX_NFS = 2
and MAX_NFS = 4
. When I get this error (correct):
I assume nothing more bad to happen, it should have done that. But, when I try to terminate the first NF initialized (speed_tester: ./go.sh 1 -d 1 -c 16000
), it does not respond.
I can explain more about the process I did later if needed.
This is indeed a weird issue, I'll check later. Any ideas why this happens?
This is indeed a weird issue, I'll check later. Any ideas why this happens?
I am looking into the code - I will keep you updated if I find something funky
Edit: I've been trying to replicate this bug again; but I'm having trouble. I follow @kevindweb steps, and assure that the ID we are trying to kill is proper by inserting a print statement:
However I can easily kill the first NF running, and I am assuring that the new reuse ID function is being used. Additionally, I am making sure to test with setting MAX_NFs to 4 right away, instead of setting to 2 first then attempting to replicate.
Exact steps: 1) Checkout reuse instance ids 2) Limit MAX_NFs to 4, make 3) Run manager 4) Run 4 speed_tester NF's, getting an error code on running the 4th 5) Attempting to kill first NF initialized (working properly, in my cases)
Agree with @dennisafa I can't seem to reproduce the behavior, @kevindweb are there any key steps we're missing here?
Dennis told me he was able to reproduce it the same way I explained. First, I changed MAX_NFS to 2, and started 2 speed_testers, this created an error with the first initialized speed_tester. Then I did the same thing with MAX_NFS=4. I don't know exactly why this happens though.
Did you make clean && make
the NFs?
I just tested it the same way with cleaning and making /examples and /onvm again
Not sure why I can't reproduce, I'll take a look at the lab
Dennis told me he was able to reproduce it the same way I explained. First, I changed MAX_NFS to 2, and started 2 speed_testers, this created an error with the first initialized speed_tester. Then I did the same thing with MAX_NFS=4. I don't know exactly why this happens though.
Strangely, this issue pops up only every once in a while. In a few cases, it'll occur on the second or third NF initialized. Could this be an issue unrelated to this specific PR? I would understand if the instance_id variable for the first NF was somehow corrupted when getting the error code (because then we couldn't clean up properly) but it's valid every time I'm testing it..
@dennisafa @kevindweb maybe you guys where onto something. I found a nasty bug regarding this PR. When running different NFs and reusing the old instance_id they would segfault on pthread_join. After some gdb debugging and being very confused I have found the reason. We never cleaned up old function pointers in the onvm_nf
struct.
For example: We run speed tester on instance id 1. Then we run a few other NFs, stop speed tester, and start a new simple_forward NF which after a wrap around gets instance id 1. This would segfault because the simple forward would try to call the nf_setup
function from the old speed_tester setup. This would not segfault, but it would corrupt our memory badly. I'm fixing this and bundling it into a memory cleanup pr and I'll assign you guys for review as it might have been related to issues you were seeing.
@onvm go for it
@onvm go for it
Your results will arrive shortly
@onvm go for it
Run successful see results: [Results from nimbnode30] Median TX pps for Speed Tester: 35204617
examples/aes_decrypt/aes.h:176: #endif line should be "#endif // _AESH" [build/header_guard] [5]
Total errors found: 1
examples/aes_encrypt/aes.h:185: #endif line should be "#endif // _AESH" [build/header_guard] [5]
Total errors found: 1
examples/flow_table/flow_table.h:63: #endif line should be "#endif // _FLOW_TABLEH" [build/header_guard] [5]
Total errors found: 1
examples/flow_table/msgbuf.h:71: #endif line should be "#endif // _MSGBUFH" [build/header_guard] [5]
Total errors found: 1
examples/flow_table/openflow.h:969: #endif line should be "#endif // _OPENFLOWH" [build/header_guard] [5]
examples/flow_table/openflow.h:50: Using deprecated casting style. Use static_cast
Instead of stopping when we reach MAX_NFS, wrap back to the initial instance ID starting value.
Summary:
Reuse instance IDs of old NFs that have terminated. I've initially implemented an inline function for the while loop so we don't use the code twice, but I revised this as I think the 2 loops with comments just look cleaner.
Usage: I tested with decreasing MAX_NFS number to 4, seems to work didn't test all the small details yet.
Merging notes:
TODO before merging :
Test Plan:
Try to break this.
Review:
TBA