Increased `CONSISTENCY_MIN_VERDICT_INTERVAL` and adapted tests

viniarck commented 1 year ago

Fixes #172

Increased CONSISTENCY_MIN_VERDICT_INTERVAL to 60 (this is the half the value we have by default), so the consistency check, after an OF handshake will only kick in after this interval at least, so this helps out to avoid having premature consistency check runs that would lead to slower convergence and false positive asserts during convergence (if we have any asserts going on during convergence).
- Increased STATS_INTERVAL to 30 to minimize the overhead of FlowStats messages and decrease a bit of IO/CPU on OvS instances. This interval was initially low intentionally to avoid waiting too much for flows to be installed again and trigger the consistency check. Since the consistency check is triggered after the initial handshake/flow stats reply https://github.com/kytos-ng/flow_manager/pull/115, then we can also trigger this by forcing a TCP reconnection by setting the controller on the switches again, check out the reconnect_switches() helper method that has been implemented, and has been used on test cases where it was expected that the consistency check would've run at that point. In the future, I think we could also have an endpoint on flow_manager to force triggering the consistency check to run on demand, but we probably won't need in the short term.
- Removed an xfail from test_005_install_flow since it's been fixed.
- Removed parametrize('execution_number', range(10)) repeated executions since the tests marked with it haven't been flaky.
- Overall the tests took 2h50, I avoid trying to speed up any other the existing sleep to not introduce new variables in the meantime, since there are other issues like issue #173 that hasn't been fixed yet. It's saved 10 mins because of the parametrize repetition that was removed.
I tested this branch with flow_manager branch on AmLight's GitLab CI, it's passing:
```
$ ./kytos-init.sh
```
sed -i 's/STATS_INTERVAL = 60/STATS_INTERVAL = 30/g' /var/lib/kytos/napps/kytos/of_core/settings.py
sed -i 's/CONSISTENCY_MIN_VERDICT_INTERVAL =.*/CONSISTENCY_MIN_VERDICT_INTERVAL = 60/g' /var/lib/kytos/napps/kytos/flow_manager/settings.py
sed -i 's/LINK_UP_TIMER = 10/LINK_UP_TIMER = 1/g' /var/lib/kytos/napps/kytos/topology/settings.py
sed -i 's/DEPLOY_EVCS_INTERVAL = 60/DEPLOY_EVCS_INTERVAL = 5/g' /var/lib/kytos/napps/kytos/mef_eline/settings.py
sed -i 's/LLDP_LOOP_ACTIONS = ["log"]/LLDP_LOOP_ACTIONS = ["disable","log"]/' /var/lib/kytos/napps/kytos/of_lldp/settings.py
sed -i 's/LLDP_IGNORED_LOOPS = {}/LLDP_IGNORED_LOOPS = {"00:00:00:00:00:00:00:01": [[4, 5]]}/' /var/lib/kytos/napps/kytos/of_lldp/settings.py
sed -i 's/CONSISTENCY_COOKIE_IGNORED_RANGE =.*/CONSISTENCY_COOKIE_IGNORED_RANGE = [(0xdd00000000000000, 0xdd00000000000009)]/g' /var/lib/kytos/napps/kytos/flow_manager/settings.py
sed -i 's/LIVENESS_DEAD_MULTIPLIER =.*/LIVENESS_DEAD_MULTIPLIER = 3/g' /var/lib/kytos/napps/kytos/of_lldp/settings.py
kytosd --help
sed -i s/WARNING/INFO/g /etc/kytos/logging.ini
test -z tests/
python3 scripts/wait_for_mongo.py Trying to run hello command on MongoDB... Trying to run 'hello' command on MongoDB... Ran 'hello' command on MongoDB successfully. It's ready!
python3 -m pytest tests/ ============================= test session starts ============================== platform linux -- Python 3.9.2, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 rootdir: /builds/amlight/kytos-end-to-end-tester/kytos-end-to-end-tests plugins: timeout-2.0.2 collected 191 items tests/test_e2e_01_kytos_startup.py .. [ 1%] tests/test_e2e_05_topology.py .................. [ 10%] tests/test_e2e_10_mef_eline.py .........Xss.....x....x..... [ 25%] tests/test_e2e_11_mef_eline.py ..... [ 27%] tests/test_e2e_12_mef_eline.py .....xx. [ 31%] tests/test_e2e_13_mef_eline.py .....xxXx......xxXx.XXxX.xxxx..x......... [ 53%] ... [ 54%] tests/test_e2e_14_mef_eline.py x [ 55%] tests/test_e2e_15_maintenance.py ........................ [ 68%] tests/test_e2e_15_mef_eline.py . [ 68%] tests/test_e2e_20_flow_manager.py ................ [ 76%] tests/test_e2e_21_flow_manager.py .. [ 78%] tests/test_e2e_22_flow_manager.py ............... [ 85%] tests/test_e2e_23_flow_manager.py .............. [ 93%] tests/test_e2e_30_of_lldp.py .... [ 95%] tests/test_e2e_31_of_lldp.py ... [ 96%] tests/test_e2e_32_of_lldp.py ... [ 98%] tests/test_e2e_40_sdntrace.py X.x [100%]

= 164 passed, 2 skipped, 18 xfailed, 7 xpassed, 758 warnings in 10140.69s (2:49:00) =

gretelliz commented 1 year ago

Hello @viniarck. That seems very good to me. Just a question:

CONSISTENCY_MIN_VERDICT_INTERVAL decreases from 60*2 to 60 instead of increasing, right? Same for STATS_INTERVAL, it went from 60 to 30.

viniarck commented 1 year ago

Hello @viniarck. That seems very good to me. Just a question:

CONSISTENCY_MIN_VERDICT_INTERVAL decreases from 60*2 to 60 instead of increasing, right? Same for STATS_INTERVAL, it went from 60 to 30.

@gretelliz, that's correct if you're comparing with the default values of of_core and flow_manager. The rationale, here is to try to use a value that shouldn't be too low and/or close to the default one (or a value that is reasonable for production also). We could also use the default values, however, initially I set them a bit lower but not too low to try to also allow for these parameters to be individually tested in new additional tests, since prior to this PR having a lower value was also testing exercising more variables but then we also had other false positives that we're trying to find a balance.

gretelliz commented 1 year ago

I see. Thanks @viniarck, got it.

viniarck commented 1 year ago

I see. Thanks @viniarck, got it.

Thanks for your review @gretelliz.

I'll add more test results executions here in the next days as we've discussed on our weekly meeting.

gretelliz commented 1 year ago

Fine @viniarck , I was thinking about this. It will be great. I approved this PR, but I think what Italo commented is important. I am actually working with the parameter CONSISTENCY_MIN_VERDICT_INTERVAL for a issue of mef_eline.

viniarck commented 1 year ago

Here's a summary of test executions on GitLab CI:

9/11, 81% successful executions. Failed tests accumulated test_031_on_switch_restart_kytos_should_recreate_flows
9/12, 75% successful executions. Failed tests accumulated test_031_on_switch_restart_kytos_should_recreate_flows, test_040_replace_action_flow, test_030_modify_match, test_085_create_and_remove_ten_circuit_concurrently

I also ran more executions but they timed out during on GitLab execution queue so I didn't compute them in the results
I reran locally test_031_on_switch_restart_kytos_should_recreate_flows 150 times, all iterations passed:

iter 100
+ for i in {1..150}
+ echo 'iter 150'
+ python3 -m pytest tests/ --timeout=300 -k test_031_on_switch_restart_kytos_should_recreate_flows
============================= test session starts ==============================
collected 191 items / 190 deselected / 1 selected

tests/test_e2e_21_flow_manager.py .                                      [100%]
========== 1 passed, 190 deselected, 34 warnings in 92.72s (0:01:32) ===========
+ tail -f

Conclusions so far:

Increasing STATS_INTERVAL and CONSISTENCY_MIN_VERDICT_INTERVAL have helped to minimize certain consistency checks false positives.
Overall, the successful executions have been >= 75%, compared to the current nightly execution that is failing every other day that's at least an improvement.
After analyzing the failed test cases, they seem to related when flows don't get re-installed, it might be a bug when the consistency should be running after a handshake or it's an overload that's delaying messaging internally when pushing/reading OpenFlow state. This point, without additional evidence in the logs couldn't be confirmed on the CI yet.

Suggestions:

1) Stick with the higher CONSISTENCY_MIN_VERDICT_INTERVAL to minimize the variables that can result in false positives, especially since successful rate of test executions has improved, keep using the reconnect_switches method to still be compatible with the existing time.sleeps. Map new test cases to test out exclusively the consistency check, and then only start trying to reduce its value when e2e nightly tests are rarely falling (maybe at most every 15 days?) 2) Introduce pytest-rerunfailures when assert errors happen, but still showing in the reports that it retried the test, that way, potential flaky tests or non-deterministic bugs we can have more evidence of the situation, and the team can still keep observing this, collecting more logs and eventually fix any remaining issues. Retries on tests aren't ideal, but all things considered, I believe it'll be beneficial in terms of effort to analyze and fix whatever needs to be fixed in the future.

I've left more seven e2e test executions in the CI with pytest-rerunfailures, let's see how it goes. I'll post another update once I have results. If you have other suggestions let me know cc'ing @italovalcy @gretelliz.

viniarck commented 1 year ago

I've left more seven e2e test executions in the CI with pytest-rerunfailures, let's see how it goes. I'll post another update once I have results. If you have other suggestions let me know cc'ing @italovalcy @gretelliz.

Here's a follow up with yesterday's seven CI executions, they all passed, and in total there were two test cases rerun:

20221110_140553

For the re-runs I'll explore now other formatting options since we need the failed assertions here as well

=========================== rerun test summary info ============================
RERUN tests/test_e2e_20_flow_manager.py::TestE2EFlowManager::test_040_replace_action_flow
= 164 passed, 2 skipped, 18 xfailed, 7 xpassed, 758 warnings, 1 rerun in 10203.85s (2:50:03) =

=========================== rerun test summary info ============================
RERUN tests/test_e2e_20_flow_manager.py::TestE2EFlowManager::test_050_add_action_flow
= 164 passed, 2 skipped, 18 xfailed, 7 xpassed, 758 warnings, 1 rerun in 10195.23s (2:49:55) =

viniarck commented 1 year ago

I've run more seven CI executions, they all passed, in total there were only 2 reruns:

20221115_173409

Here's an example of rerun, I've updated the conftest report hook to also report the errors when it reruns:

------------------------------- start/stop times -------------------------------
rerun: 0
tests/test_e2e_20_flow_manager.py::TestE2EFlowManager::test_030_modify_match: 2022-11-15,01:13:44.375720 - 2022-11-15,01:14:09.344504
self = <tests.test_e2e_20_flow_manager.TestE2EFlowManager object at 0x7fe21d3cfb20>
    def test_030_modify_match(self):
>       self.modify_match()
tests/test_e2e_20_flow_manager.py:384: 

        flows_s1 = s1.dpctl('dump-flows')
>       assert len(flows_s1.split('\r\n ')) == BASIC_FLOWS + 1, flows_s1
E       AssertionError:  cookie=0xab00000000000001, duration=14.147s, table=0, n_packets=8, n_bytes=336, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
E         
E       assert 1 == (3 + 1)
E        +  where 1 = len([' cookie=0xab00000000000001, duration=14.147s, table=0, n_packets=8, n_bytes=336, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535\r\n'])
E        +    where [' cookie=0xab00000000000001, duration=14.147s, table=0, n_packets=8, n_bytes=336, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535\r\n'] = <built-in method split of str object at 0x7fe21c0bab90>('\r\n ')
E        +      where <built-in method split of str object at 0x7fe21c0bab90> = ' cookie=0xab00000000000001, duration=14.147s, table=0, n_packets=8, n_bytes=336, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535\r\n'.split
tests/test_e2e_20_flow_manager.py:380: AssertionError
=========================== rerun test summary info ============================
RERUN tests/test_e2e_20_flow_manager.py::TestE2EFlowManager::test_030_modify_match
= 164 passed, 2 skipped, 18 xfailed, 7 xpassed, 758 warnings, 1 rerun in 10179.29s (2:49:39) =

viniarck commented 1 year ago

I'll go ahead and merge this one. I've just synced with Italo about the remaining of the discussion that was open.

Appreciated the reviews.

kytos-ng / kytos-end-to-end-tests

Increased `CONSISTENCY_MIN_VERDICT_INTERVAL` and adapted tests #178