medik8s / self-node-remediation

Automatic repair for unhealthy Kubernetes nodes
https://www.medik8s.io/
Apache License 2.0
45 stars 17 forks source link

Fix flaky UT #204

Closed mshitrit closed 4 months ago

mshitrit commented 5 months ago

Why we need this PR

improve pipeline stability for shorter dev cycles

Changes made

There was a race condition in the test where in some case the daemonset was created twice, it was fixed to make sure it will only happen once.

Which issue(s) this PR fixes

Test plan

openshift-ci[bot] commented 5 months ago

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

mshitrit commented 5 months ago

/test 4.15-openshift-e2e

openshift-ci[bot] commented 4 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, mshitrit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/medik8s/self-node-remediation/blob/main/OWNERS)~~ [clobrano,mshitrit] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
clobrano commented 4 months ago

Giving a chance to get more reviews /hold

mshitrit commented 4 months ago
  • createDsBeforeConfig should not be needed. It should be sufficient to define order of code execution by level of nesting of BeforeEach and JustBeforeEach

That's a nice idea, but the inner JustBeforeEach that belongs to the particular test in which I want to create the DS before the config runs after the outer JustBeforeEach which is shared with multiple tests so I'm not sure how to do it.

  • I don't see the value of isSkipCleanup.

In the cleanup process we verify that the delete was successful, we can't do it if no resources were created.

And directly related, what's the value of AfterEach checking the existence of the resources, before deleting them?

At that stage I know the resources were set up to be created but since the creation is async we need to verify they were created, otherwise we run into the risk of resources being created after the delete was executed.

  • Did you try to just add the verify delete is done part to the old DeferCleanup()? That should ensure that no ds exists when the operator update tests run...

It's possible to run the AfterEach code in DeferCleanup, in this case I preferred using an "AfterEach" block because I think it has slightly better readability, DeferCleanup creates another nesting and adds a bit of complication to the BeforeEach so I prefer to use it with a relative short cleanup code. It's definitely something I consider very minor so DeferCleanup is fine by me as well.

slintes commented 4 months ago

createDsBeforeConfig should not be needed. It should be sufficient to define order of code execution by level of nesting of BeforeEach and JustBeforeEach

That's a nice idea, but the inner JustBeforeEach that belongs to the particular test in which I want to create the DS before the config runs after the outer JustBeforeEach which is shared with multiple tests so I'm not sure how to do it.

ooohhhh.... that's a very fair point, I remembered it the wrong around. However, that made me wonder how the test ever succeeds! And revisiting the code, the problem isn't that we run into a race condition which fails the test sometimes, the problem is that it's coincidence that the test succeeds most times! Let me explain:

This can easily be demonstrated: just add a one second sleep at the end of the outer JustBeforeEach, that gives the config reconcile time to always create the ds before the inner JustBeforeEach creates the ds. The test always fails now.

Solution: make the code which is in the inner JustBeforeEach a function, and call it in the inner BeforeEach of both tests. That way the "inner" ds is always created before the one created by the "outer" config reconcile :)

In the cleanup process we verify that the delete was successful, we can't do it if no resources were created.

I don't see a need for this. Cleanup should ensure that resources don't exist after cleanup, no matter if they exist before. So you can just ignore DoesNotExist errors (don't remember the exact name). Let's keep it simple :)

At that stage I know the resources were set up to be created but since the creation is async we need to verify they were created

fair point, such verification should happen in test setup or execution though, not cleanup IMHO.

It's possible to run the AfterEach code in DeferCleanup, in this case I preferred using an "AfterEach" block because I think it has slightly better readability

I don't have a strong preference here as well. For more complex cleanup AfterEach() is probably easier to read, for just deleting a resource it's nice to have the cleanup code right after the creation code IMHO. However, my point here was that you added something valuable. And that could have been added to the existing DeferCleanUp(), in contrast to the bigger refactoring :)

mshitrit commented 4 months ago

the problem isn't that we run into a race condition which fails the test sometimes, the problem is that it's coincidence that the test succeeds most times!

Not splitting hair about the terminology, since I assume we mean the same thing here. We can probably agree that there is an issue with timing.

just add a one second sleep

I'm not a fan of adding a fixed sleep to solve timing issues (not to say I never do it, but I try to avoid doing so). I think it's harder to maintain and very dependent on system performance (which may vary) and it'll also cause a delay in all the rest of the tests.

Solution: make the code which is in the inner JustBeforeEach a function, and call it in the inner BeforeEach of both tests. That way the "inner" ds is always created before the one created by the "outer" config reconcile :)

I'm having a hard time to follow, IIUC the inner JustBeforeEach code Just verifies the ds is created, so it can't run before (i.e as part of a BeforeEach block) the outer JustBeforeEach code creates the config/ds. Maybe you can give an example with a code snippet ?

I don't see a need for this. Cleanup should ensure that resources don't exist after cleanup, no matter if they exist before. So you can just ignore DoesNotExist errors (don't remember the exact name). Let's keep it simple :)

I think there is value for checking because we verify that the test is going as expected which means we'll catch errors earlier and improve maintainability, I do agree though that there is a trade-off with readability.

Another reason is: that in case we are agnostic in the cleanup phase to whether resources were created or not, we risk the resources being created after we've already assumed we've deleted them in the cleanup.

And that could have been added to the existing DeferCleanUp()

The exiting DeferCleanUp() was located in a more nested JustBeforeEach block so it excluded some tests which are now part of the cleanup.

slintes commented 4 months ago

the problem isn't that we run into a race condition which fails the test sometimes, the problem is that it's coincidence that the test succeeds most times!

Not splitting hair about the terminology, since I assume we mean the same thing here. We can probably agree that there is an issue with timing.

sure, my point just is that the problem is more that it succeeds most times, not that it fails sometimes. The failure exposes a serious issue in the test setup

just add a one second sleep

I'm not a fan of adding a fixed sleep to solve timing issues (not to say I never do it, but I try to avoid doing so). I think it's harder to maintain and very dependent on system performance (which may vary) and it'll also cause a delay in all the rest of the tests.

as written, it can demonstrate the actual issue. I didn't want to propose adding a timeout permanently (it makes the test always fail). The proposal was to add the sleep on your machine so you can reproduce the issue always, and not just sometimes

Solution: make the code which is in the inner JustBeforeEach a function, and call it in the inner BeforeEach of both tests. That way the "inner" ds is always created before the one created by the "outer" config reconcile :)

I'm having a hard time to follow, IIUC the inner JustBeforeEach code Just verifies the ds is created, so it can't run before (i.e as part of a BeforeEach block) the outer JustBeforeEach code creates the config/ds. Maybe you can give an example with a code snippet ?

I was explaining the issue we have in the current test code, not in the PR code. When you look at the current code, my explanation of the issue should make sense, does it?

I don't see a need for this. Cleanup should ensure that resources don't exist after cleanup, no matter if they exist before. So you can just ignore DoesNotExist errors (don't remember the exact name). Let's keep it simple :)

I think there is value for checking because we verify that the test is going as expected which means we'll catch errors earlier and improve maintainability, I do agree though that there is a trade-off with readability.

sure, but verifying "that the test is going as expected" is very clearly a task for test setup ((Just)BeforeEach()) or execution (It()), not cleanup ((Just)AfterEach())!

Another reason is: that in case we are agnostic in the cleanup phase to whether resources were created or not, we risk the resources being created after we've already assumed we've deleted them in the cleanup.

On one hand: as stated above, the test setup or test execution should verify that resources we expect to be created are created already. On the other hand you have a point indeed in this specific case: if we test the config only and don't care about ds, we might miss cleanup of the ds in case the test finishes faster than the ds is created. However, IMHO a Consistently with 1s for the ds deletion would be nicer than that boolean flag + Eventually for the getting the ds. Then we don't need to care about setting that flag for each test or not.

And that could have been added to the existing DeferCleanUp()

The exiting DeferCleanUp() was located in a more nested JustBeforeEach block so it excluded some tests which are now part of the cleanup.

fair point

btw, looking at the later tests, there are more things to improve:

mshitrit commented 4 months ago

as written, it can demonstrate the actual issue. I didn't want to propose adding a timeout permanently

you are correct of course, I apologize for misunderstanding :+1:

I was explaining the issue we have in the current test code, not in the PR code. When you look at the current code, my explanation of the issue should make sense, does it?

I think I get what you mean, IIUC basically the BeforeEach of every test will trigger it's own resource creation (please correct me if I got it wrong again). The downside of this approach IMO is that we don't utilize the a shared block (BeforeEach/JustBeforeEach) for several tests, so each Test needs to have a BeforeEach block of it's own, I think the more common pattern is to create the resource in a Joined JustBeforeEach block and only in case it's necessary use individual BeforeEach block to modify the resources prior to creation.

IMHO a Consistently with 1s for the ds deletion would be nicer than that boolean flag + Eventually for the getting the ds. Then we don't need to care about setting that flag for each test or not.

I think the downside of this approach that it would create false positives test results. For example let's say that we delete and use a Consistently block to verify that the ds is gone: in case the timing was off (and the delete executed before the ds was created) the test will fail. With the current approach it'll not happen and the ds will be successfully deleted.

btw, looking at the later tests, there are more things to improve:

In general agree it can be improved, ATM I prefer not to have the discussion about the best way to it in this PR.

slintes commented 4 months ago

Maybe code can better explain what I mean 🤷🏼‍♂️ https://github.com/medik8s/self-node-remediation/pull/208

mshitrit commented 4 months ago

Maybe code can better explain what I mean 🤷🏼‍♂️

208

Thanks for sharing that code, I understand better what you mean now, it's definitely an interesting approach I haven't considered.

I think though that there are couple of things to consider with this approach:

slintes commented 4 months ago

At the first block (in case we didn't get the DS in the test) it is still needs to be cleaned

That's what it's doing, not?

we will always wait for the maximum time

Yes, but with the advantage of not having to care about cleanup by setting bools in the test code

make the test less stable on resource heavy environment

This is just a unit test with the fake envtest api server. We just need to be sure that the config was reconciled and it created the ds. I don't expect that we need a higher duration than 2 seconds. I would even go with 1 second only.

mshitrit commented 4 months ago

That's what it's doing, not?

Not the way I understand the code. In case the DS wasn't set in the test (i.e ds.GetName() == "") and it isn't found (for example because it wasn't created yet) we will immediately exit the Consistently block on the return statement without waiting for the 2 seconds period.

I don't expect that we need a higher duration than 2 seconds

I wish I was that optimistic as well, but from what I've seen so far (there where several PRs in the past where I had to increase wait times) there are situations where our unit tests needs more time. I've found it to be particular true for Test container build phase which seems more flaky.

Yes, but with the advantage of not having to care about cleanup by setting bools in the test code

Slowing all the tests down in order to avoid a variable isn't a good tradeoff IMO.

slintes commented 4 months ago

That's what it's doing, not?

Not the way I understand the code. In case the DS wasn't set in the test (i.e ds.GetName() == "") and it isn't found (for example because it wasn't created yet) we will immediately exit the Consistently block on the return statement without waiting for the 2 seconds period.

No, the return is the same as leaving the function at the end, it does not change the behaviour of Consistently. The function will be called again after 500ms until the 2s is reached. I think it's not possible to leave Consitently early in any way, that's the whole point of Consistently in contrast to Eventually

mshitrit commented 4 months ago

No, the return is the same as leaving the function at the end, it does not change the behaviour of Consistently

Cool, didn't know that, thanks for the explanation :+1:

mshitrit commented 4 months ago

/retest