Closed aerosouund closed 2 months ago
Hi @aerosouund. Thanks for your PR.
PRs from untrusted users cannot be marked as trusted with /ok-to-test
in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all
.
I understand the commands that are listed here.
/test all
/retest-required
/test all
@dhiller highly agree on the part of our current approach needing improvement. however i believe that in the original bash approach the reason why it worked because the lines were arranged properly in terms of what depends on what.
e.g:
# CRDS
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/snapshot.storage.k8s.io_volumesnapshots.yaml
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
# CLUSTER ROLES
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/rbac-snapshot-controller.yaml
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/setup-snapshot-controller.yaml
# NAMESPACE
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/common.yaml
# MORE CRDS
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/crds.yaml
# DEPENDENT RESOURCES
kubectl --kubeconfig /etc/kubernetes/admin.conf create -f /tmp/ceph/operator.yaml
And this was transferred to the original go implementation
manifests := []string{
"manifests/snapshot.storage.k8s.io_volumesnapshots.yaml",
"manifests/snapshot.storage.k8s.io_volumesnapshotcontents.yaml",
"manifests/snapshot.storage.k8s.io_volumesnapshotclasses.yaml",
"manifests/rbac-snapshot-controller.yaml",
"manifests/setup-snapshot-controller.yaml",
"manifests/common.yaml",
"manifests/crds.yaml",
"manifests/operator.yaml",
"manifests/cluster-test.yaml",
"manifests/pool-test.yaml",
}
for _, manifest := range manifests {
yamlData, err := f.ReadFile(manifest)
if err != nil {
return err
}
if err := o.client.Apply(yamlData); err != nil {
return err
}
}
But since we changed it to this
err := fs.WalkDir(f, "manifests", func(path string, d fs.DirEntry, err error) error {
if err != nil {
return err
}
if !d.IsDir() && filepath.Ext(path) == ".yaml" {
yamlData, err := f.ReadFile(path)
if err != nil {
return err
}
yamlDocs := bytes.Split(yamlData, []byte("---\n"))
for _, yamlDoc := range yamlDocs {
if len(yamlDoc) == 0 {
continue
}
obj, err := k8s.SerializeIntoObject(yamlDoc)
if err != nil {
logrus.Info(err.Error())
continue
}
if err := o.client.Apply(obj); err != nil {
return fmt.Errorf("error applying manifest %s", err)
}
}
}
return nil
})
We lost the ordering of the manifests. I apologize for not having noticed this myself either since i depended on the unit test to validate the change. but the unit test uses the fake kubernetes client which doesn't preform this dependency check
var (
k8sClient k8s.K8sDynamicClient
opt *cephOpt
)
BeforeEach(func() {
r := k8s.NewReactorConfig("create", "cephblockpools", CephReactor)
k8sClient = k8s.NewTestClient(r)
opt = NewCephOpt(k8sClient)
})
It("should execute NfsCsiOpt successfully", func() {
err := opt.Exec()
Expect(err).NotTo(HaveOccurred())
})
And so i suggest we go back to the original implementation as a potential solution
Another problem is related to updates of the manifests: if we are splitting the original yamls we are complicating things for people trying to apply updates to them, since they need to re-split them and adjust them - before they could just throw them into the folder and the update would be done. This also should work as it did before.
Happy to enhance this part as well, but can you please clarify more ? because the splitting doesn't happen at the level of how we store the manifests, it happens during execution of an apply call. so its perfectly safe to group multiple manifests as one without worrying about how they are stored.
The reason why i did the change of grouping the ceph crds is to avoid having to embed every single CRD file and applying it before the rest of the resources, just from a code cleanliness perspective. but it was perfectly valid to keep them as they are
Another problem is related to updates of the manifests: if we are splitting the original yamls we are complicating things for people trying to apply updates to them, since they need to re-split them and adjust them - before they could just throw them into the folder and the update would be done. This also should work as it did before.
Happy to enhance this part as well, but can you please clarify more ? because the splitting doesn't happen at the level of how we store the manifests, it happens during execution of an apply call. so its perfectly safe to group multiple manifests as one without worrying about how they are stored.
The reason why i did the change of grouping the ceph crds is to avoid having to embed every single CRD file and applying it before the rest of the resources, just from a code cleanliness perspective. but it was perfectly valid to keep them as they are
I was referring to this PR - in this you extract the namespace definition from the original yaml, which is a modification that people trying to update the target manifest need to reapply.
@dhiller I have further findings about this I was curious as to why this issue appeared for ceph and nfs csi only, while it should have also appeared in prometheus and CNAO
As it turns out this code
err := fs.WalkDir(f, "manifests", func(path string, d fs.DirEntry, err error)
will not return a random order of the files, but rather the files sorted in alphabetical order.
And so we can understand why it didn't happen for prometheus and CNAO, since the namespace file in prometheus is placed in 0namespace-namespace.yaml
and for CNAO:
crd.yaml
namespace.yaml
network-addons-config-example.cr.yaml
operator.yaml
whereabouts.yaml
all the dependent resources have names that starts with letters after n. And so another solution is to name the file containing the namespace with a 0 in the beginning or something similar. Doing this means we don't need to order the manifests, and we don't need to embed and apply single manifests before others (the solution that this PR implements)
Let me know what you think
I was referring to this PR - in this you extract the namespace definition from the original yaml, which is a modification that people trying to update the target manifest need to reapply.
I see, well. based on the findings i shared some form of manifests ordering is needed. Let me know which approach works best in my opinion the approach of renaming the namespace file to have a 0 in the beginning is the best since it allows us to keep all the benefits accrued from walking the directory
@dhiller I have further findings about this I was curious as to why this issue appeared for ceph and nfs csi only, while it should have also appeared in prometheus and CNAO
As it turns out this code
err := fs.WalkDir(f, "manifests", func(path string, d fs.DirEntry, err error)
will not return a random order of the files, but rather the files sorted in alphabetical order.
Great to hear that it's deterministic :smile:
And so we can understand why it didn't happen for prometheus and CNAO, since the namespace file in prometheus is placed in
0namespace-namespace.yaml
and for CNAOcrd.yaml namespace.yaml network-addons-config-example.cr.yaml operator.yaml whereabouts.yaml
all the dependent resources have names that starts with letters after n. And so another solution is to name the file containing the namespace with a 0 in the beginning or something similar. Doing this means we don't need to order the manifests, and we don't need to embed and apply single manifests before others (the solution that this PR implements)
Let me know what you think
Sounds good to me.
Awesome, i'll push code shortly
@dhiller @brianmcarey Code updated according to our conversation Lets run tests again please :)
/test all
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: dhiller
The full list of commands accepted by this bot can be found here.
The pull request process is described here
@aerosouund: The following tests failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
check-provision-k8s-1.28 | db9e4ee1a664402855117a81d25245473c2ee60d | link | true | /test check-provision-k8s-1.28 |
check-up-kind-1.30-vgpu | fc6917c9559eb951f872238d6d553b1eee500989 | link | false | /test check-up-kind-1.30-vgpu |
check-up-kind-1.27-vgpu | fc6917c9559eb951f872238d6d553b1eee500989 | link | false | /test check-up-kind-1.27-vgpu |
@aerosouund I see an error being logged for check-provision-manager here: https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirtci/1255/check-provision-manager/1828740557837438976#1:build-log.txt%3A237
/hold for investigation of the provision-manager check
@dhiller I am not sure if this is an actual error, i saw this on several of my merged PRs and when retested it passes. It was actually passing on this PR before brian's lgtm
@dhiller I am not sure if this is an actual error, i saw this on several of my merged PRs and when retested it passes. It was actually passing on this PR before brian's lgtm
You are right - I overlooked that this is just a logging of the error which is occuring during testing.
/unhold
/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel
or /hold
comment for consistent failures.
/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel
or /hold
comment for consistent failures.
What this PR does / why we need it:
Walking the directory results in things that depend on the namespace to be created first and provisioning fails. This ensures the namespace is the first thing created. Allow more time for the test volume to become ready.
Checklist
This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR. Approvers are expected to review this list.