Fix flaky UT - Githubissues

mshitrit commented 5 months ago

Why we need this PR

improve pipeline stability for shorter dev cycles

Changes made

There was a race condition in the test where in some case the daemonset was created twice, it was fixed to make sure it will only happen once.

Which issue(s) this PR fixes

Test plan

openshift-ci[bot] commented 5 months ago

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

mshitrit commented 5 months ago

/test 4.15-openshift-e2e

openshift-ci[bot] commented 4 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, mshitrit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/medik8s/self-node-remediation/blob/main/OWNERS)~~ [clobrano,mshitrit] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

clobrano commented 4 months ago

Giving a chance to get more reviews /hold

mshitrit commented 4 months ago

createDsBeforeConfig should not be needed. It should be sufficient to define order of code execution by level of nesting of BeforeEach and JustBeforeEach

That's a nice idea, but the inner JustBeforeEach that belongs to the particular test in which I want to create the DS before the config runs after the outer JustBeforeEach which is shared with multiple tests so I'm not sure how to do it.

I don't see the value of isSkipCleanup.

In the cleanup process we verify that the delete was successful, we can't do it if no resources were created.

And directly related, what's the value of AfterEach checking the existence of the resources, before deleting them?

At that stage I know the resources were set up to be created but since the creation is async we need to verify they were created, otherwise we run into the risk of resources being created after the delete was executed.

Did you try to just add the verify delete is done part to the old DeferCleanup()? That should ensure that no ds exists when the operator update tests run...

It's possible to run the AfterEach code in DeferCleanup, in this case I preferred using an "AfterEach" block because I think it has slightly better readability, DeferCleanup creates another nesting and adds a bit of complication to the BeforeEach so I prefer to use it with a relative short cleanup code. It's definitely something I consider very minor so DeferCleanup is fine by me as well.

slintes commented 4 months ago

createDsBeforeConfig should not be needed. It should be sufficient to define order of code execution by level of nesting of BeforeEach and JustBeforeEach

That's a nice idea, but the inner JustBeforeEach that belongs to the particular test in which I want to create the DS before the config runs after the outer JustBeforeEach which is shared with multiple tests so I'm not sure how to do it.

ooohhhh.... that's a very fair point, I remembered it the wrong around. However, that made me wonder how the test ever succeeds! And revisiting the code, the problem isn't that we run into a race condition which fails the test sometimes, the problem is that it's coincidence that the test succeeds most times! Let me explain:

the outer JustBeforeEach creates the config
reconciling the config triggers ds creation
this competes with the ds creation of the inner JustBeforeEach

This can easily be demonstrated: just add a one second sleep at the end of the outer JustBeforeEach, that gives the config reconcile time to always create the ds before the inner JustBeforeEach creates the ds. The test always fails now.

Solution: make the code which is in the inner JustBeforeEach a function, and call it in the inner BeforeEach of both tests. That way the "inner" ds is always created before the one created by the "outer" config reconcile :)

In the cleanup process we verify that the delete was successful, we can't do it if no resources were created.

I don't see a need for this. Cleanup should ensure that resources don't exist after cleanup, no matter if they exist before. So you can just ignore DoesNotExist errors (don't remember the exact name). Let's keep it simple :)

At that stage I know the resources were set up to be created but since the creation is async we need to verify they were created

fair point, such verification should happen in test setup or execution though, not cleanup IMHO.

It's possible to run the AfterEach code in DeferCleanup, in this case I preferred using an "AfterEach" block because I think it has slightly better readability

I don't have a strong preference here as well. For more complex cleanup AfterEach() is probably easier to read, for just deleting a resource it's nice to have the cleanup code right after the creation code IMHO. However, my point here was that you added something valuable. And that could have been added to the existing DeferCleanUp(), in contrast to the bigger refactoring :)

mshitrit commented 4 months ago

the problem isn't that we run into a race condition which fails the test sometimes, the problem is that it's coincidence that the test succeeds most times!

Not splitting hair about the terminology, since I assume we mean the same thing here. We can probably agree that there is an issue with timing.

just add a one second sleep

I'm not a fan of adding a fixed sleep to solve timing issues (not to say I never do it, but I try to avoid doing so). I think it's harder to maintain and very dependent on system performance (which may vary) and it'll also cause a delay in all the rest of the tests.

Solution: make the code which is in the inner JustBeforeEach a function, and call it in the inner BeforeEach of both tests. That way the "inner" ds is always created before the one created by the "outer" config reconcile :)

I'm having a hard time to follow, IIUC the inner JustBeforeEach code Just verifies the ds is created, so it can't run before (i.e as part of a BeforeEach block) the outer JustBeforeEach code creates the config/ds. Maybe you can give an example with a code snippet ?

I don't see a need for this. Cleanup should ensure that resources don't exist after cleanup, no matter if they exist before. So you can just ignore DoesNotExist errors (don't remember the exact name). Let's keep it simple :)

I think there is value for checking because we verify that the test is going as expected which means we'll catch errors earlier and improve maintainability, I do agree though that there is a trade-off with readability.

Another reason is: that in case we are agnostic in the cleanup phase to whether resources were created or not, we risk the resources being created after we've already assumed we've deleted them in the cleanup.

And that could have been added to the existing DeferCleanUp()

The exiting DeferCleanUp() was located in a more nested JustBeforeEach block so it excluded some tests which are now part of the cleanup.

slintes commented 4 months ago

the problem isn't that we run into a race condition which fails the test sometimes, the problem is that it's coincidence that the test succeeds most times!

Not splitting hair about the terminology, since I assume we mean the same thing here. We can probably agree that there is an issue with timing.

sure, my point just is that the problem is more that it succeeds most times, not that it fails sometimes. The failure exposes a serious issue in the test setup

just add a one second sleep

I'm not a fan of adding a fixed sleep to solve timing issues (not to say I never do it, but I try to avoid doing so). I think it's harder to maintain and very dependent on system performance (which may vary) and it'll also cause a delay in all the rest of the tests.

as written, it can demonstrate the actual issue. I didn't want to propose adding a timeout permanently (it makes the test always fail). The proposal was to add the sleep on your machine so you can reproduce the issue always, and not just sometimes

Solution: make the code which is in the inner JustBeforeEach a function, and call it in the inner BeforeEach of both tests. That way the "inner" ds is always created before the one created by the "outer" config reconcile :)

I'm having a hard time to follow, IIUC the inner JustBeforeEach code Just verifies the ds is created, so it can't run before (i.e as part of a BeforeEach block) the outer JustBeforeEach code creates the config/ds. Maybe you can give an example with a code snippet ?

I was explaining the issue we have in the current test code, not in the PR code. When you look at the current code, my explanation of the issue should make sense, does it?

I don't see a need for this. Cleanup should ensure that resources don't exist after cleanup, no matter if they exist before. So you can just ignore DoesNotExist errors (don't remember the exact name). Let's keep it simple :)

I think there is value for checking because we verify that the test is going as expected which means we'll catch errors earlier and improve maintainability, I do agree though that there is a trade-off with readability.

sure, but verifying "that the test is going as expected" is very clearly a task for test setup ((Just)BeforeEach()) or execution (It()), not cleanup ((Just)AfterEach())!

Another reason is: that in case we are agnostic in the cleanup phase to whether resources were created or not, we risk the resources being created after we've already assumed we've deleted them in the cleanup.

On one hand: as stated above, the test setup or test execution should verify that resources we expect to be created are created already. On the other hand you have a point indeed in this specific case: if we test the config only and don't care about ds, we might miss cleanup of the ds in case the test finishes faster than the ds is created. However, IMHO a Consistently with 1s for the ds deletion would be nicer than that boolean flag + Eventually for the getting the ds. Then we don't need to care about setting that flag for each test or not.

And that could have been added to the existing DeferCleanUp()

The exiting DeferCleanUp() was located in a more nested JustBeforeEach block so it excluded some tests which are now part of the cleanup.

fair point

btw, looking at the later tests, there are more things to improve:

"SNRC defaults": move config setup and creation into a BeforeEach()
"Wrong self node remediation config CR": can be removed, we don't allow multiple configs anymore

mshitrit commented 4 months ago

as written, it can demonstrate the actual issue. I didn't want to propose adding a timeout permanently

you are correct of course, I apologize for misunderstanding :+1:

I was explaining the issue we have in the current test code, not in the PR code. When you look at the current code, my explanation of the issue should make sense, does it?

I think I get what you mean, IIUC basically the BeforeEach of every test will trigger it's own resource creation (please correct me if I got it wrong again). The downside of this approach IMO is that we don't utilize the a shared block (BeforeEach/JustBeforeEach) for several tests, so each Test needs to have a BeforeEach block of it's own, I think the more common pattern is to create the resource in a Joined JustBeforeEach block and only in case it's necessary use individual BeforeEach block to modify the resources prior to creation.

IMHO a Consistently with 1s for the ds deletion would be nicer than that boolean flag + Eventually for the getting the ds. Then we don't need to care about setting that flag for each test or not.

I think the downside of this approach that it would create false positives test results. For example let's say that we delete and use a Consistently block to verify that the ds is gone: in case the timing was off (and the delete executed before the ds was created) the test will fail. With the current approach it'll not happen and the ds will be successfully deleted.

btw, looking at the later tests, there are more things to improve:

In general agree it can be improved, ATM I prefer not to have the discussion about the best way to it in this PR.

slintes commented 4 months ago

Maybe code can better explain what I mean 🤷🏼‍♂️ https://github.com/medik8s/self-node-remediation/pull/208

mshitrit commented 4 months ago

Maybe code can better explain what I mean 🤷🏼‍♂️

208

Thanks for sharing that code, I understand better what you mean now, it's definitely an interesting approach I haven't considered.

I think though that there are couple of things to consider with this approach:

At the first block (in case we didn't get the DS in the test) it is still needs to be cleaned since not getting it in the test does not mean it wasn't scheduled to be created, by skipping the clean in case it isn't found we risk the ds being created after the cleanup phase.
Using this approach we will always wait for the maximum time (2 seconds in this case) in order to verify the ds deleted (vs the other approach were we just wait until the resource is actually deleted)
Because we always wait for the maximum time it would be an issue choosing a higher "safer" value (for example 5 or 10 seconds) for the verification. That means we need to settle for a lower wait time that might make the test less stable on resource heavy environment, comparing to the other approach where using a higher "maximum wait time" isn't an issue because it'll be applied only when necessary.
I think it's more intuitive to understand the approach of
a) I've found the resource b) I've delete the resource c) I wait to make sure it's deleted Vs a) Delete the resource b) make sure it doesn't exist c) keep doing that to make sure it's actually deleted.
I do agree though that this is my subjective point of view and others my have a different opinion which is no less valid.

slintes commented 4 months ago

At the first block (in case we didn't get the DS in the test) it is still needs to be cleaned

That's what it's doing, not?

we will always wait for the maximum time

Yes, but with the advantage of not having to care about cleanup by setting bools in the test code

make the test less stable on resource heavy environment

This is just a unit test with the fake envtest api server. We just need to be sure that the config was reconciled and it created the ds. I don't expect that we need a higher duration than 2 seconds. I would even go with 1 second only.

mshitrit commented 4 months ago

That's what it's doing, not?

Not the way I understand the code. In case the DS wasn't set in the test (i.e ds.GetName() == "") and it isn't found (for example because it wasn't created yet) we will immediately exit the Consistently block on the return statement without waiting for the 2 seconds period.

I don't expect that we need a higher duration than 2 seconds

I wish I was that optimistic as well, but from what I've seen so far (there where several PRs in the past where I had to increase wait times) there are situations where our unit tests needs more time. I've found it to be particular true for Test container build phase which seems more flaky.

Yes, but with the advantage of not having to care about cleanup by setting bools in the test code

Slowing all the tests down in order to avoid a variable isn't a good tradeoff IMO.

slintes commented 4 months ago

That's what it's doing, not?

Not the way I understand the code. In case the DS wasn't set in the test (i.e ds.GetName() == "") and it isn't found (for example because it wasn't created yet) we will immediately exit the Consistently block on the return statement without waiting for the 2 seconds period.

No, the return is the same as leaving the function at the end, it does not change the behaviour of Consistently. The function will be called again after 500ms until the 2s is reached. I think it's not possible to leave Consitently early in any way, that's the whole point of Consistently in contrast to Eventually

mshitrit commented 4 months ago

No, the return is the same as leaving the function at the end, it does not change the behaviour of Consistently

Cool, didn't know that, thanks for the explanation :+1:

mshitrit commented 4 months ago

/retest

medik8s / self-node-remediation

Fix flaky UT #204

Why we need this PR

Changes made

Which issue(s) this PR fixes

Test plan

208