Closed viniarck closed 1 year ago
Hello @viniarck. That seems very good to me. Just a question:
CONSISTENCY_MIN_VERDICT_INTERVAL
decreases from 60*2
to 60
instead of increasing, right? Same for STATS_INTERVAL, it went from 60 to 30.Hello @viniarck. That seems very good to me. Just a question:
CONSISTENCY_MIN_VERDICT_INTERVAL
decreases from60*2
to60
instead of increasing, right? Same for STATS_INTERVAL, it went from 60 to 30.
@gretelliz, that's correct if you're comparing with the default values of of_core
and flow_manager
. The rationale, here is to try to use a value that shouldn't be too low and/or close to the default one (or a value that is reasonable for production also). We could also use the default values, however, initially I set them a bit lower but not too low to try to also allow for these parameters to be individually tested in new additional tests, since prior to this PR having a lower value was also testing exercising more variables but then we also had other false positives that we're trying to find a balance.
I see. Thanks @viniarck, got it.
I see. Thanks @viniarck, got it.
Thanks for your review @gretelliz.
I'll add more test results executions here in the next days as we've discussed on our weekly meeting.
Fine @viniarck , I was thinking about this. It will be great. I approved this PR, but I think what Italo commented is important. I am actually working with the parameter CONSISTENCY_MIN_VERDICT_INTERVAL for a issue of mef_eline.
Here's a summary of test executions on GitLab CI:
9/11, 81% successful executions. Failed tests accumulated test_031_on_switch_restart_kytos_should_recreate_flows
9/12, 75% successful executions. Failed tests accumulated test_031_on_switch_restart_kytos_should_recreate_flows, test_040_replace_action_flow, test_030_modify_match, test_085_create_and_remove_ten_circuit_concurrently
I also ran more executions but they timed out during on GitLab execution queue so I didn't compute them in the results
I reran locally test_031_on_switch_restart_kytos_should_recreate_flows
150 times, all iterations passed:
iter 100
+ for i in {1..150}
+ echo 'iter 150'
+ python3 -m pytest tests/ --timeout=300 -k test_031_on_switch_restart_kytos_should_recreate_flows
============================= test session starts ==============================
collected 191 items / 190 deselected / 1 selected
tests/test_e2e_21_flow_manager.py . [100%]
========== 1 passed, 190 deselected, 34 warnings in 92.72s (0:01:32) ===========
+ tail -f
Conclusions so far:
STATS_INTERVAL
and CONSISTENCY_MIN_VERDICT_INTERVAL
have helped to minimize certain consistency checks false positives. Suggestions:
1) Stick with the higher CONSISTENCY_MIN_VERDICT_INTERVAL
to minimize the variables that can result in false positives, especially since successful rate of test executions has improved, keep using the reconnect_switches
method to still be compatible with the existing time.sleep
s. Map new test cases to test out exclusively the consistency check, and then only start trying to reduce its value when e2e nightly tests are rarely falling (maybe at most every 15 days?)
2) Introduce pytest-rerunfailures
when assert errors happen, but still showing in the reports that it retried the test, that way, potential flaky tests or non-deterministic bugs we can have more evidence of the situation, and the team can still keep observing this, collecting more logs and eventually fix any remaining issues. Retries on tests aren't ideal, but all things considered, I believe it'll be beneficial in terms of effort to analyze and fix whatever needs to be fixed in the future.
I've left more seven e2e test executions in the CI with pytest-rerunfailures
, let's see how it goes. I'll post another update once I have results. If you have other suggestions let me know cc'ing @italovalcy @gretelliz.
I've left more seven e2e test executions in the CI with
pytest-rerunfailures
, let's see how it goes. I'll post another update once I have results. If you have other suggestions let me know cc'ing @italovalcy @gretelliz.
Here's a follow up with yesterday's seven CI executions, they all passed, and in total there were two test cases rerun:
For the re-runs I'll explore now other formatting options since we need the failed assertions here as well
=========================== rerun test summary info ============================
RERUN tests/test_e2e_20_flow_manager.py::TestE2EFlowManager::test_040_replace_action_flow
= 164 passed, 2 skipped, 18 xfailed, 7 xpassed, 758 warnings, 1 rerun in 10203.85s (2:50:03) =
=========================== rerun test summary info ============================
RERUN tests/test_e2e_20_flow_manager.py::TestE2EFlowManager::test_050_add_action_flow
= 164 passed, 2 skipped, 18 xfailed, 7 xpassed, 758 warnings, 1 rerun in 10195.23s (2:49:55) =
I've run more seven CI executions, they all passed, in total there were only 2 reruns:
Here's an example of rerun, I've updated the conftest report hook to also report the errors when it reruns:
------------------------------- start/stop times -------------------------------
rerun: 0
tests/test_e2e_20_flow_manager.py::TestE2EFlowManager::test_030_modify_match: 2022-11-15,01:13:44.375720 - 2022-11-15,01:14:09.344504
self = <tests.test_e2e_20_flow_manager.TestE2EFlowManager object at 0x7fe21d3cfb20>
def test_030_modify_match(self):
> self.modify_match()
tests/test_e2e_20_flow_manager.py:384:
flows_s1 = s1.dpctl('dump-flows')
> assert len(flows_s1.split('\r\n ')) == BASIC_FLOWS + 1, flows_s1
E AssertionError: cookie=0xab00000000000001, duration=14.147s, table=0, n_packets=8, n_bytes=336, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
E
E assert 1 == (3 + 1)
E + where 1 = len([' cookie=0xab00000000000001, duration=14.147s, table=0, n_packets=8, n_bytes=336, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535\r\n'])
E + where [' cookie=0xab00000000000001, duration=14.147s, table=0, n_packets=8, n_bytes=336, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535\r\n'] = <built-in method split of str object at 0x7fe21c0bab90>('\r\n ')
E + where <built-in method split of str object at 0x7fe21c0bab90> = ' cookie=0xab00000000000001, duration=14.147s, table=0, n_packets=8, n_bytes=336, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535\r\n'.split
tests/test_e2e_20_flow_manager.py:380: AssertionError
=========================== rerun test summary info ============================
RERUN tests/test_e2e_20_flow_manager.py::TestE2EFlowManager::test_030_modify_match
= 164 passed, 2 skipped, 18 xfailed, 7 xpassed, 758 warnings, 1 rerun in 10179.29s (2:49:39) =
I'll go ahead and merge this one. I've just synced with Italo about the remaining of the discussion that was open.
Appreciated the reviews.
Fixes #172
Increased
CONSISTENCY_MIN_VERDICT_INTERVAL
to 60 (this is the half the value we have by default), so the consistency check, after an OF handshake will only kick in after this interval at least, so this helps out to avoid having premature consistency check runs that would lead to slower convergence and false positive asserts during convergence (if we have any asserts going on during convergence).STATS_INTERVAL
to 30 to minimize the overhead of FlowStats messages and decrease a bit of IO/CPU on OvS instances. This interval was initially low intentionally to avoid waiting too much for flows to be installed again and trigger the consistency check. Since the consistency check is triggered after the initial handshake/flow stats reply https://github.com/kytos-ng/flow_manager/pull/115, then we can also trigger this by forcing a TCP reconnection by setting the controller on the switches again, check out thereconnect_switches()
helper method that has been implemented, and has been used on test cases where it was expected that the consistency check would've run at that point. In the future, I think we could also have an endpoint on flow_manager to force triggering the consistency check to run on demand, but we probably won't need in the short term.xfail
from test_005_install_flow since it's been fixed.parametrize('execution_number', range(10))
repeated executions since the tests marked with it haven't been flaky.sleep
to not introduce new variables in the meantime, since there are other issues like issue #173 that hasn't been fixed yet. It's saved 10 mins because of theparametrize
repetition that was removed.I tested this branch with
flow_manager
branch on AmLight's GitLab CI, it's passing:= 164 passed, 2 skipped, 18 xfailed, 7 xpassed, 758 warnings in 10140.69s (2:49:00) =