Closed rm3l closed 4 months ago
There are a couple of issues to me:
- It is not clear what issue we are solving here and so the "status" of this testing is not clear as well. It refers to OCP versions which clearly based on certain K8s version, but the tests are running against K8s (sure) so it can not be taken as OCP compatibility tests anyway.
I would ideally run those tests directly on OCP if it was as easy as a GitHub Action, but I couldn't find a simple way to spin up a minimal OCP cluster in a GitHub runner. So this uses the OCP versions we claim support for as baselines to determine the versions of K8s we could/should test against.
- It definitely brings some useful compatibility information for K8s X.Y version, while Z version can be basically "any" (no breaking changes expected). Taking into account that K8s releases every 3-4 months, to me it sounds like overkill to make it nightly.
The variable here is the operator code (which can change frequently), which I think we should have real E2E tests for, but against the versions of K8s that we should support (this list won't change frequently either). There is also one useful part here in this Workflow IMO: ahead-of-time testing against versions of K8s that we do not support yet will help us catch potential issues in advance.
- In my opinion having them nightly just overloads the actions output (and calculation resources :) ) with mostly the same information.
The actions output are separated per job, so per configuration in the matrix. So it won't change that much from the existing nightly workflow. It is just we have new jobs from the matrix to look at.
In terms of resources, all matrix configuration jobs run in parallel, so there is no impact on the overall time of the whole Workflow. The current nightly workflow on main
takes ~19m to run, and the Workflow from this PR took almost the same amount of time.
So, I would say it as a useful test but we probably need to have more clear definition of it. For example: as an upstream pre-release tests which makes sure we support supportable K8s versions and happened as a part of pre-release procedure and not nightly.
The goal of the nightly Workflow is to prevent a last-minute rush before a release when you notice something does not work end-to-end, by running a comprehensive suite of tests against the released image regularly. This way, based on your last nightly job status, you can even release any good commit at any time.
Are you concerned about having to check the status too frequently?
You can setup tests on OpenShift CI to run against OpenShift clusters, there are lots of examples of how to do this in the github.com/openshift org, specifically look at the https://github.com/openshift/release repository
You can setup tests on OpenShift CI to run against OpenShift clusters, there are lots of examples of how to do this in the github.com/openshift org, specifically look at the openshift/release repository
Yeah, I know, and that would be a potential next step (if we agree with the approach here).. For now, this tries to leverage the existing GH Workflow we have for testing against vanilla K8s.
There are a couple of issues to me:
- It is not clear what issue we are solving here and so the "status" of this testing is not clear as well. It refers to OCP versions which clearly based on certain K8s version, but the tests are running against K8s (sure) so it can not be taken as OCP compatibility tests anyway.
I would ideally run those tests directly on OCP if it was as easy as a GitHub Action, but I couldn't find a simple way to spin up a minimal OCP cluster in a GitHub runner. So this uses the OCP versions we claim support for as baselines to determine the versions of K8s we could/should test against.
- It definitely brings some useful compatibility information for K8s X.Y version, while Z version can be basically "any" (no breaking changes expected). Taking into account that K8s releases every 3-4 months, to me it sounds like overkill to make it nightly.
The variable here is the operator code (which can change frequently), which I think we should have real E2E tests for, but against the versions of K8s that we should support (this list won't change frequently either). There is also one useful part here in this Workflow IMO: ahead-of-time testing against versions of K8s that we do not support yet will help us catch potential issues in advance.
- In my opinion having them nightly just overloads the actions output (and calculation resources :) ) with mostly the same information.
The actions output are separated per job, so per configuration in the matrix. So it won't change that much from the existing nightly workflow. It is just we have new jobs from the matrix to look at. In terms of resources, all matrix configuration jobs run in parallel, so there is no impact on the overall time of the whole Workflow. The current nightly workflow on
main
takes ~19m to run, and the Workflow from this PR took almost the same amount of time.So, I would say it as a useful test but we probably need to have more clear definition of it. For example: as an upstream pre-release tests which makes sure we support supportable K8s versions and happened as a part of pre-release procedure and not nightly.
The goal of the nightly Workflow is to prevent a last-minute rush before a release when you notice something does not work end-to-end, by running a comprehensive suite of tests against the released image regularly. This way, based on your last nightly job status, you can even release any good commit at any time.
Are you concerned about having to check the status too frequently?
Like I said, the compatibility tests (k8s or/and OCP if possible) are useful.
At the same time in absolute majority of the cases they do not bring any useful information daily, so very quickly will be percieved as a "white noice", as a related example you can see current state of nightly tests, they failed in many cases by known reason but we do not care about it b/c they are not that important b/c pretty much everything is covered in other (PR) flows. On the other hand, from outside, it looks like as broken environment. That's why I really doubt in usefulness of daily tests in our context(!) at all.
So, I see such tests rather as a part of the (conscious) [pre/after] release process/recommendations than periodical job.
I do not really understand your "last-minute rush" concern, it's time to focus on development and time to focus on release ("code freeze" state in our case). Could you please be more specific about the cases you have in mind?
Like I said, the compatibility tests (k8s or/and OCP if possible) are useful.
At the same time in absolute majority of the cases they do not bring any useful information daily, so very quickly will be percieved as a "white noice", as a related example you can see current state of nightly tests, they failed in many cases by known reason but we do not care about it b/c they are not that important b/c pretty much everything is covered in other (PR) flows. On the other hand, from outside, it looks like as broken environment. That's why I really doubt in usefulness of daily tests in our context(!) at all.
PR tests currently do not run against a real cluster to make sure that everything work end-to-end. So to me, I am not confident enough that a merged PR with the current tests passing guarantees that the deployed application is actually reachable (which is what matters to the end-user, right?).
Take https://github.com/janus-idp/operator/pull/390 for example. All PR tests passed, but the nightly job that ran against it allowed us to detect the day after an issue with the latest
tag of the Showcase application (because the pod deployed ended up crashing on a real cluster).
I would rather not accumulate a lot of changes and run a manual E2E test a few days before a release..
So, I see such tests rather as a part of the (conscious) [pre/after] release process/recommendations than periodical job.
Is there such a process in place? A recommendation is only useful if there is a way to enforce it, and a nightly job is an almost cheap and automated way to enforce that. ;)
I do not really understand your "last-minute rush" concern, it's time to focus on development and time to focus on release ("code freeze" state in our case). Could you please be more specific about the cases you have in mind?
If we want to release at any time, nightly tests will help for example.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from gazarenkov. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
I see this compatibility testing as good to have for sure but:
The state of current nightly tests in general is not satisfable. They are not stable for a quite long time and so looks broken. #399 That's why extending it with any new functionality is clearly no go from my side.
I already stated that they are not passing (only since the past few days) because of https://github.com/janus-idp/operator/pull/390 which was recently merged. The issue (thankfully caught quickly by the nightly checks) is being tracked in this JIRA: https://issues.redhat.com/browse/RHIDP-3157 Before that, they have been quite stable so far. From time to time, some infra issues due to the runner itself, but nothing critical per se.
Anyway, let's bring this discussion in one of our WG calls, to have more opinions on how to approach this.
/hold
I see this compatibility testing as good to have for sure but: The state of current nightly tests in general is not satisfable. They are not stable for a quite long time and so looks broken. #399 That's why extending it with any new functionality is clearly no go from my side.
I already stated that they are not passing (only since the past few days) because of #390 which was recently merged. The issue (thankfully caught quickly by the nightly checks) is being tracked in this JIRA: https://issues.redhat.com/browse/RHIDP-3157 Before that, they have been quite stable so far. From time to time, some infra issues due to the runner itself, but nothing critical per se.
Looks like we are talking different issues.
Take a look here:
https://github.com/janus-idp/operator/actions?page=2&query=event%3Aschedule
And you can see that majority of Nightly builds are failed. Here's the issue I created for it https://github.com/janus-idp/operator/issues/399 , please take a look Do you mean something else saying it is not a problem?
Anyway, let's bring this discussion in one of our WG calls, to have more opinions on how to approach this.
/hold
PR needs rebase.
Description
To help us detect potential issues in advance, this expands the E2E configuration matrix to test several versions of K8s:
Which issue(s) does this PR fix or relate to
—
PR acceptance criteria
rhdh-operator.csv.yaml
file accordinglyHow to test changes / Special notes to the reviewer
Example of run here: https://github.com/rm3l/janus-idp-operator/actions/runs/9809691738
The failures are not related to the changes here, but to an issue with the default
janus-idp/backstage-showcase
image'slatest
tag being outdated, and recently-merged https://github.com/janus-idp/operator/pull/390 only works in 1.2+. See https://redhat-internal.slack.com/archives/C04CUSD4JSG/p1720191475569659