kubevirt / kubevirtci

Contains cluster definitions and client tools to quickly spin up and destroy ephemeral and scalable k8s and ocp clusters for testing
Apache License 2.0
82 stars 119 forks source link

bug: Apply the NFS-CSI namespace first thing #1255

Closed aerosouund closed 2 months ago

aerosouund commented 2 months ago

What this PR does / why we need it:

Walking the directory results in things that depend on the namespace to be created first and provisioning fails. This ensures the namespace is the first thing created. Allow more time for the test volume to become ready.

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR. Approvers are expected to review this list.

kubevirt-bot commented 2 months ago

Hi @aerosouund. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
dhiller commented 2 months ago

/test all

xpivarc commented 2 months ago

/retest-required

brianmcarey commented 2 months ago

/test all

aerosouund commented 2 months ago

@dhiller highly agree on the part of our current approach needing improvement. however i believe that in the original bash approach the reason why it worked because the lines were arranged properly in terms of what depends on what.

e.g:

# CRDS
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/snapshot.storage.k8s.io_volumesnapshots.yaml
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/snapshot.storage.k8s.io_volumesnapshotclasses.yaml

# CLUSTER ROLES
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/rbac-snapshot-controller.yaml
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/setup-snapshot-controller.yaml

# NAMESPACE
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/common.yaml

# MORE CRDS
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/crds.yaml

# DEPENDENT RESOURCES
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/operator.yaml

And this was transferred to the original go implementation

    manifests := []string{
        "manifests/snapshot.storage.k8s.io_volumesnapshots.yaml",
        "manifests/snapshot.storage.k8s.io_volumesnapshotcontents.yaml",
        "manifests/snapshot.storage.k8s.io_volumesnapshotclasses.yaml",
        "manifests/rbac-snapshot-controller.yaml",
        "manifests/setup-snapshot-controller.yaml",
        "manifests/common.yaml",
        "manifests/crds.yaml",
        "manifests/operator.yaml",
        "manifests/cluster-test.yaml",
        "manifests/pool-test.yaml",
    }

    for _, manifest := range manifests {
        yamlData, err := f.ReadFile(manifest)
        if err != nil {
            return err
        }
        if err := o.client.Apply(yamlData); err != nil {
            return err
        }
    }

But since we changed it to this

    err := fs.WalkDir(f, "manifests", func(path string, d fs.DirEntry, err error) error {
        if err != nil {
            return err
        }
        if !d.IsDir() && filepath.Ext(path) == ".yaml" {
            yamlData, err := f.ReadFile(path)
            if err != nil {
                return err
            }
            yamlDocs := bytes.Split(yamlData, []byte("---\n"))
            for _, yamlDoc := range yamlDocs {
                if len(yamlDoc) == 0 {
                    continue
                }

                obj, err := k8s.SerializeIntoObject(yamlDoc)
                if err != nil {
                    logrus.Info(err.Error())
                    continue
                }
                if err := o.client.Apply(obj); err != nil {
                    return fmt.Errorf("error applying manifest %s", err)
                }
            }
        }
        return nil
    })

We lost the ordering of the manifests. I apologize for not having noticed this myself either since i depended on the unit test to validate the change. but the unit test uses the fake kubernetes client which doesn't preform this dependency check

    var (
        k8sClient k8s.K8sDynamicClient
        opt       *cephOpt
    )

    BeforeEach(func() {
        r := k8s.NewReactorConfig("create", "cephblockpools", CephReactor)
        k8sClient = k8s.NewTestClient(r)
        opt = NewCephOpt(k8sClient)
    })

    It("should execute NfsCsiOpt successfully", func() {
        err := opt.Exec()
        Expect(err).NotTo(HaveOccurred())
    })

And so i suggest we go back to the original implementation as a potential solution

aerosouund commented 2 months ago

Another problem is related to updates of the manifests: if we are splitting the original yamls we are complicating things for people trying to apply updates to them, since they need to re-split them and adjust them - before they could just throw them into the folder and the update would be done. This also should work as it did before.

Happy to enhance this part as well, but can you please clarify more ? because the splitting doesn't happen at the level of how we store the manifests, it happens during execution of an apply call. so its perfectly safe to group multiple manifests as one without worrying about how they are stored.

The reason why i did the change of grouping the ceph crds is to avoid having to embed every single CRD file and applying it before the rest of the resources, just from a code cleanliness perspective. but it was perfectly valid to keep them as they are

dhiller commented 2 months ago

Another problem is related to updates of the manifests: if we are splitting the original yamls we are complicating things for people trying to apply updates to them, since they need to re-split them and adjust them - before they could just throw them into the folder and the update would be done. This also should work as it did before.

Happy to enhance this part as well, but can you please clarify more ? because the splitting doesn't happen at the level of how we store the manifests, it happens during execution of an apply call. so its perfectly safe to group multiple manifests as one without worrying about how they are stored.

The reason why i did the change of grouping the ceph crds is to avoid having to embed every single CRD file and applying it before the rest of the resources, just from a code cleanliness perspective. but it was perfectly valid to keep them as they are

I was referring to this PR - in this you extract the namespace definition from the original yaml, which is a modification that people trying to update the target manifest need to reapply.

aerosouund commented 2 months ago

@dhiller I have further findings about this I was curious as to why this issue appeared for ceph and nfs csi only, while it should have also appeared in prometheus and CNAO

As it turns out this code

    err := fs.WalkDir(f, "manifests", func(path string, d fs.DirEntry, err error)

will not return a random order of the files, but rather the files sorted in alphabetical order. And so we can understand why it didn't happen for prometheus and CNAO, since the namespace file in prometheus is placed in 0namespace-namespace.yaml and for CNAO:

crd.yaml              
namespace.yaml
network-addons-config-example.cr.yaml  
operator.yaml    
whereabouts.yaml                  

all the dependent resources have names that starts with letters after n. And so another solution is to name the file containing the namespace with a 0 in the beginning or something similar. Doing this means we don't need to order the manifests, and we don't need to embed and apply single manifests before others (the solution that this PR implements)

Let me know what you think

aerosouund commented 2 months ago

I was referring to this PR - in this you extract the namespace definition from the original yaml, which is a modification that people trying to update the target manifest need to reapply.

I see, well. based on the findings i shared some form of manifests ordering is needed. Let me know which approach works best in my opinion the approach of renaming the namespace file to have a 0 in the beginning is the best since it allows us to keep all the benefits accrued from walking the directory

dhiller commented 2 months ago

@dhiller I have further findings about this I was curious as to why this issue appeared for ceph and nfs csi only, while it should have also appeared in prometheus and CNAO

As it turns out this code

  err := fs.WalkDir(f, "manifests", func(path string, d fs.DirEntry, err error)

will not return a random order of the files, but rather the files sorted in alphabetical order.

Great to hear that it's deterministic :smile:

And so we can understand why it didn't happen for prometheus and CNAO, since the namespace file in prometheus is placed in 0namespace-namespace.yaml and for CNAO

crd.yaml              
namespace.yaml
network-addons-config-example.cr.yaml  
operator.yaml    
whereabouts.yaml                  

all the dependent resources have names that starts with letters after n. And so another solution is to name the file containing the namespace with a 0 in the beginning or something similar. Doing this means we don't need to order the manifests, and we don't need to embed and apply single manifests before others (the solution that this PR implements)

Let me know what you think

Sounds good to me.

aerosouund commented 2 months ago

Awesome, i'll push code shortly

aerosouund commented 2 months ago

@dhiller @brianmcarey Code updated according to our conversation Lets run tests again please :)

dhiller commented 2 months ago

/test all

kubevirt-bot commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhiller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubevirt/kubevirtci/blob/main/OWNERS)~~ [dhiller] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
kubevirt-bot commented 2 months ago

@aerosouund: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
check-provision-k8s-1.28 db9e4ee1a664402855117a81d25245473c2ee60d link true /test check-provision-k8s-1.28
check-up-kind-1.30-vgpu fc6917c9559eb951f872238d6d553b1eee500989 link false /test check-up-kind-1.30-vgpu
check-up-kind-1.27-vgpu fc6917c9559eb951f872238d6d553b1eee500989 link false /test check-up-kind-1.27-vgpu
Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
dhiller commented 2 months ago

@aerosouund I see an error being logged for check-provision-manager here: https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirtci/1255/check-provision-manager/1828740557837438976#1:build-log.txt%3A237

dhiller commented 2 months ago

/hold for investigation of the provision-manager check

aerosouund commented 2 months ago

@dhiller I am not sure if this is an actual error, i saw this on several of my merged PRs and when retested it passes. It was actually passing on this PR before brian's lgtm

dhiller commented 2 months ago

@dhiller I am not sure if this is an actual error, i saw this on several of my merged PRs and when retested it passes. It was actually passing on this PR before brian's lgtm

You are right - I overlooked that this is just a logging of the error which is occuring during testing.

/unhold

kubevirt-commenter-bot commented 2 months ago

/retest-required This bot automatically retries required jobs that failed/flaked on approved PRs. Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-commenter-bot commented 2 months ago

/retest-required This bot automatically retries required jobs that failed/flaked on approved PRs. Silence the bot with an /lgtm cancel or /hold comment for consistent failures.