tests/integration: Fix flaky `TestScaleoutClusterSuite`

anoopcs9 commented 2 years ago

TestScaleoutClusterSuite fails more frequently with the following error:

=== RUN   TestIntegration/reconciliation/scaleoutCluster/TestScaleoutClusterSuite
    reconcile_test.go:151:
        Error Trace:    reconcile_test.go:151
        Error:          Not equal:
                        expected: 3
                        actual  : 2
        Test:           TestIntegration/reconciliation/scaleoutCluster/TestScaleoutClusterSuite
        Messages:       Clustersize not as expected

Above check is to make sure that number of replicas within StatefulSet reflects the updated SmbShare.Spec.Scaling.MinClusterSize. But an immediate check on StatefulSet.Spec.Replicas might not always give us the desired(updated) value.

Therefore we retry this check within a brief 3 seconds timeout on account of any delay in field update. In addition, we at least wait for the existence of extra pods corresponding to updated replica count. Considering the increased overall test time we further raise the timeout from 20m to 30m.

anoopcs9 commented 2 years ago

Hm.. we were already at the verge of test timeout(20m) and now with this change possibility is higher as we can see with the very first run.

=== RUN   TestIntegration/reconciliation/scaleoutCluster
    reconcile_test.go:142: test ID: gtxg2j
=== RUN   TestIntegration/reconciliation/scaleoutCluster/TestScaleoutClusterSuite
=== RUN   TestIntegration/resourceUpdate
=== RUN   TestIntegration/resourceUpdate/SmbShareUpdateSuite
    resource_update_test.go:78: test ID: 7v842d
=== RUN   TestIntegration/resourceUpdate/SmbShareUpdateSuite/TestEditReadOnly
    resource_update_test.go:149: checking smbclient login to share
    resource_update_test.go:186: Setting readonly=true for SmbShare samba-operator-system/tshare1-7v842d
    resource_update_test.go:227: checking smbclient write to share
    resource_update_test.go:186: Setting readonly=false for SmbShare samba-operator-system/tshare1-7v842d
panic: test timed out after 20m0s

phlogistonjohn commented 2 years ago

Hm.. we were already at the verge of test timeout(20m) and now with this change possibility is higher as we can see with the very first run.

On thursday I was saying that I noticed the non-clustered runs were taking a while when I had some networking issues the suite was timing out. I have a change to increase both by 10 mins (to 20 minutes and 30 minutes). e038c17159b785f6ddec89d8aba6aadb17f7f4a4 is clearly not yet ready for prime time, but feel free to take and adapt it for this PR if you want to.

anoopcs9 commented 2 years ago

Hm.. we were already at the verge of test timeout(20m) and now with this change possibility is higher as we can see with the very first run.

On thursday I was saying that I noticed the non-clustered runs were taking a while when I had some networking issues the suite was timing out. I have a change to increase both by 10 mins (to 20 minutes and 30 minutes). e038c17 is clearly not yet ready for prime time, but feel free to take and adapt it for this PR if you want to.

I added the change for increasing timeout for clustered test runs from 20m to 30m.

anoopcs9 commented 2 years ago

I guess the test already had that flaw though. Maybe what we really want is (pseudocode):
updateSmbShare();
ctx2 := contextWithTimeout()
poll(ctx, func() {
  l, err := StatefulSets.List(...)
  checkStatefulSet(l)
})
require.NoError(waitForPodExist(ctx, s), "smb server pod does not exist")
What do you think?

Ok, sounds reasonable.

anoopcs9 commented 2 years ago

Networking issues on infra where CentOS CI is hosted: https://lists.centos.org/pipermail/ci-users/2022-August/004605.html

Rook setup is most likely unsuccessful due to the above outage. But here is a new make check failure from golangci-lint:

internal/resources/metrics.go:14:2: could not import k8s.io/apimachinery/pkg/types (-: could not load export
data: cannot import "k8s.io/apimachinery/pkg/types" (unstable iexport format version 2, just rebuild compiler
and std library), export data is newer version - update tool) (typecheck)
    "k8s.io/apimachinery/pkg/types"
    ^

phlogistonjohn commented 2 years ago

Wow, that's quite the error. Oddly, that's not what I see when I look in the CI logs. Rather, I see:

go: downloading google.golang.org/protobuf v1.27.1
go: downloading github.com/matttproud/golang_protobuf_extensions v1.0.1
golangci-lint installed in /home/runner/work/samba-operator/samba-operator/.bin
/home/runner/work/samba-operator/samba-operator/.bin/golangci-lint -c .golangci.yaml run ./...
! gofmt -e -s -l . | sed 's,^,formatting error: ,' | grep 'go$'
make: *** [Makefile:214: check-format] Error 1
formatting error: tests/integration/reconcile_test.go
Error: Process completed with exit code 2.

phlogistonjohn commented 2 years ago

I looked at the patch, and the CI is "right" there are formatting issues in your patch (spacing in the struct fields in the poll.Prober. I'm wondering if the other error is a local issue on your setup with that linter?

anoopcs9 commented 2 years ago

I looked at the patch, and the CI is "right" there are formatting issues in your patch (spacing in the struct fields in the poll.Prober.

Running gofmt with -d reported the following diff:

--- tests/integration/reconcile_test.go.orig    2022-08-30 12:49:55.592909058 +0530
+++ tests/integration/reconcile_test.go 2022-08-30 12:49:55.592909058 +0530
@@ -202,7 +202,7 @@
        ctx, smbShare)
    require.NoError(err)

-   ctx2, cancel := context.WithTimeout(s.defaultContext(), 3 * time.Second)
+   ctx2, cancel := context.WithTimeout(s.defaultContext(), 3*time.Second)
    defer cancel()
    s.Require().NoError(poll.TryUntil(ctx2, &poll.Prober{
        RetryInterval: time.Second,

-d, if present with older versions, can better pin point the formatting error.

I'm wondering if the other error is a local issue on your setup with that linter?
internal/resources/metrics.go:14:2: could not import k8s.io/apimachinery/pkg/types (-: could not load export
data: cannot import "k8s.io/apimachinery/pkg/types" (unstable iexport format version 2, just rebuild compiler
and std library), export data is newer version - update tool) (typecheck)
"k8s.io/apimachinery/pkg/types"
^
I could see few issues reported online and it has to do with the installed versions of Go and golanci-lint. I encountered above error on a system with following versions installed:

# go version
go version go1.18.4 linux/amd64

# .bin/golangci-lint --version
golangci-lint has version v1.43.0 built from (unknown, mod sum: "h1:SLwZFEmDgopqZpfP495zCtV9REUf551JJlJ51Ql7NZA=") on (unknown)

Our install script restrict golanci-lint at v1.43.0 and GitHub CI runs on Go 1.16. In order to get rid of this new error I did an update to latest golangci-lint.

phlogistonjohn commented 2 years ago

GitHub CI runs on Go 1.16

Oh, we should definitely update that at some point soonish.

anoopcs9 commented 2 years ago

/test centos-ci/sink-clustered/mini-k8s-1.24

anoopcs9 commented 2 years ago

LGTM. I was fine with it being a separate commit rather than a full blown seperate PR, but either way this now looks fine AFAIAC.

Ah..my bad !

samba-in-kubernetes / samba-operator

tests/integration: Fix flaky `TestScaleoutClusterSuite` #242