samba-in-kubernetes / samba-operator

An operator for a Samba as a service on PVCs in kubernetes
Apache License 2.0
108 stars 24 forks source link

integration test: use unique names for SmbShare resources #225

Closed phlogistonjohn closed 2 years ago

phlogistonjohn commented 2 years ago

Depends on: #224

Fixes: #222

To help reduce various possible conflicts between the tests as the re-use the same YAML files over and over for test sources, it's better to have unique names for the resources we create. It can also help tie them back to the tests that created them. This can also help avoid name conflicts between test cases and across test runs.

Because we want to have unique names we also have to remove some of the more "static" configuration input parameters. This is also a good cleanup as we should not assume the operator will name the server resource after the share. In the future, this should be derived from the serverGroup status field. However, we don't want to do too much in these patches.

phlogistonjohn commented 2 years ago

Trying to make all the names unique was too difficult and in order to get some improvements rather than none, I just make the smb share names unique. The other resources are referred to by name in the YAML files and short of building or reusing a template system I thought it would still be too much of a code change. I also like that the YAML files in the tests/files directory are all usable as-is and thus can also serve as working examples outside the test case and I didn't want to lose that property.

phlogistonjohn commented 2 years ago

Annoyingly, the centos CI is marked as failed but the actual tests passed:

/TestAvailModeUnchanged (0.11s)
        --- PASS: TestIntegration/reconciliation/scaleoutCluster (41.51s)
            --- PASS: TestIntegration/reconciliation/scaleoutCluster/TestScaleoutClusterSuite (0.08s)
PASS
ok      github.com/samba-in-kubernetes/samba-operator/tests/integration 980.713s

VS

configmap "samba-operator-controller-cfg" deleted
service "samba-operator-controller-manager-metrics-service" deleted
deployment.apps "samba-operator-controller-manager" deleted
client_loop: send disconnect: Broken pipe
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for :Building remotely : True
Logical operation result is TRUE
Running script  : # cico-node-done-from-ansible.sh
# A script that releases nodes from a SSID file written by
SSID_FILE=${SSID_FILE:-$WORKSPACE/cico-ssid}

for ssid in $(cat ${SSID_FILE})
do
    cico -q node done $ssid
done

[samba_sink-mini-k8s-1.23-clustered] $ /bin/sh -xe /tmp/jenkins2061305518328134222.sh
+ SSID_FILE=/tmp/workspace/samba_sink-mini-k8s-1.23-clustered/cico-ssid
++ cat /tmp/workspace/samba_sink-mini-k8s-1.23-clustered/cico-ssid
+ for ssid in $(cat ${SSID_FILE})
+ cico -q node done a3acc1ae

POST BUILD TASK : SUCCESS
END OF POST BUILD TASK : 0
Setting status of 64763d62d3aeec6c0d26ffab8151862339fa2907 to FAILURE with url https://jenkins-samba.apps.ocp.ci.centos.org/job/samba_sink-mini-k8s-1.23-clustered/444/ and message: 'Build finished. '
Using context: centos-ci/sink-clustered/mini-k8s-1.23
Email was triggered for: Failure - Any
Sending email for trigger: Failure - Any
Sending email to: anoopcs@samba.org
Finished: FAILURE

^ @anoopcs9

anoopcs9 commented 2 years ago

configmap "samba-operator-controller-cfg" deleted
service "samba-operator-controller-manager-metrics-service" deleted
deployment.apps "samba-operator-controller-manager" deleted
client_loop: send disconnect: Broken pipe
Build step 'Execute shell' marked build as failure

That's interesting. I'm seeing it for the first time and it looks like make delete-deploy was hung and never completed. Because the overall job took 8hr 53 min :-/ to complete(as failure due to some ssh connection timeout or so). I'll keep an eye on it. If needed we can take the other route(similar to rook teardown) where we directly destroy the cluster without proper cleanup of resources.

phlogistonjohn commented 2 years ago

It failed the same way again - a successful run of the golang tests but a broken pipe during cleanup. I am sure that these changes make the golang part of the tests take longer to run. @anoopcs9 do you think that perhaps the longer runtime is causing the overall test run to timeout and fail?

anoopcs9 commented 2 years ago

It failed the same way again - a successful run of the golang tests but a broken pipe during cleanup. I am sure that these changes make the golang part of the tests take longer to run. @anoopcs9 do you think that perhaps the longer runtime is causing the overall test run to timeout and fail?

In the last 25 runs(including nightly runs) there were 2 occurrences where it failed after successful run of actual golang tests. First time it completed within 980.713s which is comparatively longer but within 20m limit. Second time it took around 1058.948s(still within 20m limit). In both cases make delete-deploy after test run was hung for very long and eventually ssh connection times out.

Thus, even with these proposed changes for increased cleanup wait time, golang tests are getting completed within 20m(so far). Let me try to reproduce the situation(including these changes) locally and will take a decision.

dpulls[bot] commented 2 years ago

:tada: All dependencies have been resolved !

anoopcs9 commented 2 years ago

/test centos-ci/sink-clustered/mini-k8s-1.24

anoopcs9 commented 2 years ago

Thus, even with these proposed changes for increased cleanup wait time, golang tests are getting completed within 20m(so far). Let me try to reproduce the situation(including these changes) locally and will take a decision.

In my first attempt, on the machine where I tried running test suite with these changes timed out after 20m 😕. But interestingly the hang during cleanup is seen as follows once test-suite is ran successfully:

PASS
ok      github.com/samba-in-kubernetes/samba-operator/tests/integration 936.897s
yq not found in PATH, checking /root/samba-operator/.bin
controller-gen not found in PATH, checking /root/samba-operator/.bin
/root/samba-operator/.bin/controller-gen "crd:trivialVersions=true,crdVersions=v1" rbac:roleName=manager-role webhook \
    paths="./..." output:crd:artifacts:config=config/crd/bases
YQ=/root/samba-operator/.bin/yq /root/samba-operator/hack/yq-fixup-yamls.sh  /root/samba-operator/config
kustomize not found in PATH, checking /root/samba-operator/.bin
/root/samba-operator/.bin/kustomize build config/default | kubectl delete -f -
namespace "samba-operator-system" deleted
customresourcedefinition.apiextensions.k8s.io "smbcommonconfigs.samba-operator.samba.org" deleted
customresourcedefinition.apiextensions.k8s.io "smbsecurityconfigs.samba-operator.samba.org" deleted
customresourcedefinition.apiextensions.k8s.io "smbshares.samba-operator.samba.org" deleted
role.rbac.authorization.k8s.io "samba-operator-leader-election-role" deleted
clusterrole.rbac.authorization.k8s.io "samba-operator-manager-role" deleted
clusterrole.rbac.authorization.k8s.io "samba-operator-metrics-reader" deleted
clusterrole.rbac.authorization.k8s.io "samba-operator-proxy-role" deleted
rolebinding.rbac.authorization.k8s.io "samba-operator-leader-election-rolebinding" deleted
clusterrolebinding.rbac.authorization.k8s.io "samba-operator-manager-rolebinding" deleted
clusterrolebinding.rbac.authorization.k8s.io "samba-operator-proxy-rolebinding" deleted
configmap "samba-operator-controller-cfg" deleted
service "samba-operator-controller-manager-metrics-service" deleted
deployment.apps "samba-operator-controller-manager" deleted

and it waits(for what ?). I can confirm that our namespace along with any other resources we create under samba-operator-system are absent. Any out-of-the-box thoughts?

phlogistonjohn commented 2 years ago

Maybe we should use the --v=N option to get more info out of kubectl?

anoopcs9 commented 2 years ago

Maybe we should use the --v=N option to get more info out of kubectl?

I figured out the reason for hang during cleanup. I was wrong earlier as I could see smbshare resource still listed for kubectl api-resources. Further investigation led me to realize that a particular resource from TestMountPathPermissions was not getting cleaned via deleteFromFiles() because it was created using createFromFilesWithSuffix().

index d1cb8f5..66be684 100644
--- a/tests/integration/path_permissions.go
+++ b/tests/integration/path_permissions.go
@@ -79,7 +79,7 @@ func (s *MountPathPermissionsSuite) SetupSuite() {

 func (s *MountPathPermissionsSuite) TearDownSuite() {
        ctx := s.defaultContext()
-       deleteFromFiles(ctx, s.Require(), s.tc, s.smbShareSources)
+       deleteFromFilesWithSuffix(ctx, s.Require(), s.tc, s.smbShareSources, s.testID)
        deleteFromFiles(ctx, s.Require(), s.tc, s.commonSources)
 }

With the above additional change I could successfully complete the entire test procedure.

anoopcs9 commented 2 years ago

Yay.. CI passed !

phlogistonjohn commented 2 years ago

:hattip: