Closed dprotaso closed 3 months ago
+1 to removing s390x/ppc64le jobs
Sorta related I'll bring this up with TOC - to even consider dropping s390x/ppc support in our releases.
I don't think our releases work on those arch's anyway - given kourier/istio envoy images don't support it (https://explore.ggcr.dev/?image=envoyproxy%2Fenvoy%3Av1.29.0)
I'm sourcing some data from the mailing lists https://groups.google.com/g/knative-users/c/ORwp3KlFbds https://groups.google.com/g/knative-dev/c/D-UkD3xPtFA
We as part of enabling OpenShift Serverless for s390x and ppc64le architectures are actively working on knative upstream release to keep them updated. The members here working actively are @dilipgb (for s390x) and @valen-mascarenhas14 (for pp64le)
With respect to istio envoy images - we leverage maistra/envoy packages (midstream of istio envoy packages) for testing knative functionalities. There is active work happening on maintaining maistra/envoy for s390x and ppc64le architectures.
@rishikakedia Were you able to upstream the changes required to support s390x/pp64le architectures to Envoy and Istio?
I believe IBM/RH are key maintainers of Istio(not sure about Envoy)
@upodroid we pick the maistra/envoy images that are needed for knative upstream and patch the code through our ci scripts before we run the tests (refer here: https://github.com/knative/infra/blob/main/prow/jobs_config/knative/serving.yaml#L186).
There are some of tests arbitrarily failing for contour and Kourier and we are also trying to debug those issues. It takes some more time for us to figure this out. For example, in s390x today we have kourier job passed but it was failed yesterday, similarly we had contour run successfully on Monday. We need some more time to fix these issues.
Also when cron schedule for latest and main conflicts (when release happens), we will see lot of failure in our CI because jobs will compete for same resource to run tests. We make adjust cron schedule to fix them.
@upodroid Recently, we've implemented significant changes to our testing infrastructure on the ppc64le side. This included migrating all Knative workloads to a different workspace within IBM Cloud. As a result, modifications were necessary, such as updating secrets, adjusting cronjob timings, and refining ppc64le-specific scripts. These changes led to a few failures during the transition period. However we have successfully addressed these issues, and the system is now functioning smoothly. Although we encountered some intermittent failures during the transition, we have diligently resolved them, ensuring that the platform is now performing as expected.
I agree with @upodroid. This work would be better utilized when done on Istio/Envoy directly, by adding a proper support for P/Z architecture there.
Doing it on Knative level is always going to be chasing a moving target...
So, there are recent discussions started on having P/Z teams enabling upstream CI to publish images.
Seems like this openshift CI - https://github.com/openshift/release/tree/master/ci-operator/step-registry/servicemesh is being used to run e2e tests.
We are enabling istio/envoy under the hood of maistra/envoy for s390x and ppc64le architectures. There is roadmap discussion to enable envoy based on openssl for these architecture to be compatible with upstream.
I agree with @upodroid
From my perspective none of our releases work on ppc/s390x without these patches. So I don't really see the utility of these jobs being in our CI from an OSS perspective. There's no benefit to end-users of Knative who consume the releases we produce.
We as part of enabling OpenShift Serverless for s390x and ppc64le architectures
Would it make more sense to add these tests to the RH/IBM midstream repos rather than here?
Here is the associated PR for enabling envoyproxy/envoy to be openssl based for s390x: https://github.com/envoyproxy/envoy-openssl/pull/128
Fyi, what you need to do is get s390x/ppc64le binaries added to https://github.com/envoyproxy/envoy/releases/tag/v1.29.0
FYI @upodroid I'd love to, but, Google dropped us from their CI platform, so we can't get boring-ssl support back -- hence @rishikakedia's mention of the openssl roadmap. (She's on the s390x side of the IBM house. I'm on the ppc64le side.)
For reference: https://github.com/envoyproxy/envoy/pull/28363
Given @valen-mascarenhas14's comment -- these issues seem to be worked on the ppc64le side. Does that mean there are no issues on Power? I'm trying both understand the situation and to line up all the folks with who they are and who they're referring to when they say "we." :D
I read the envoy PR and the solution is to fix it properly in BoringSSL.
It seems patches do exist but you need to upstream them and give maintainer/vendor X real IBM hardware to test against those architectures.
https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/6435 https://github.com/linux-on-ibm-z/docs/wiki/Building-TensorFlow
All of this stuff needs to be upstreamed
Fyi, I don't have anything against the s390x/ppcl64e platforms but I have to repeat these important best practices(which might be done but not visible to me/public).
I wonder how much hassle we'll go through RISC-V when it becomes a thing in the future.
@upodroid : we did internal assessment and we believe that https://github.com/envoyproxy/envoy-openssl should be enabled for s390x and ppc64le architecture by first half of 2024. So I suggest we discuss about this issue of knative upstream CI post that availability?
An idea could be to instead of removing the jobs, we could disable them (and also don't do any upstream release for those platforms) and reconsider to enable them when there are official ports for those archs for envoy ? We can set a date, let's say 2024-08-01 and when there is no P/Z port for envoy we then can remove the jobs completely.
@upodroid FYI: we use prow to trigger jobs but infra for testing is provided by P/Z teams by provisioning capacity on ibm cloud.
An idea could be to instead of removing the jobs, we could disable them (and also don't do any upstream release for those platforms) and reconsider to enable them when there are official ports for those archs for envoy ? We can set a date, let's say 2024-08-01 and when there is no P/Z port for envoy we then can remove the jobs completely.
Yeah this sounds good
@upodroid -- Google is the maintainer of boringssl, and they removed support for power explicitly (see https://github.com/google/boringssl/commit/7d2338d000eb1468a5bbf78e91854236e18fb9e4). I asked one of the maintainers about adding our hardware back. It's not just a matter of upstreaming, or giving them hardware. It's complicated, but they let people know not to rely on it in their README:
BoringSSL is a fork of OpenSSL that is designed to meet Google's needs.
Although BoringSSL is an open source project, it is not intended for general use, as OpenSSL is. We don't recommend that third parties depend upon it.
So we're stuck between a rock and a hard place here because third parties are depending on it.
All that said, thanks to everyone for the consideration and flexibility.
If I'm understanding correctly, what prompted this issue is that it wasn't clear if these tests were being maintained. To my understanding they are being maintained, failures/flakes are being fixed, etc. although that maintenance may not have been communicated especially well. So given that, I don't think we need to drop them as long as they're being actively maintained.
To my understanding they are being maintained, failures/flakes are being fixed, etc.
Not really. For example, I just checked and s390x eventing tests aren't being actually run - but CI is showing green (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#s390x-e2e-tests&width=20) because something is overriding the exit value of the script.
Not really. For example, I just checked and s390x eventing tests aren't being actually run - but CI is showing green (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#s390x-e2e-tests&width=20) because something is overriding the exit value of the script.
@dprotaso I can see all the tests are running & passing for ppc64le eventing tests (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#ppc64le-e2e-tests&width=20)
To my understanding they are being maintained, failures/flakes are being fixed, etc.
Not really. For example, I just checked and s390x eventing tests aren't being actually run - but CI is showing green (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#s390x-e2e-tests&width=20) because something is overriding the exit value of the script.
@dprotaso the eventing jobs failure on s390x, we are actively debugging. Since we are sending 2 flags (--platfom=linux/s390x --insecure_registry) to KO_FLAGS, the platform is not getting recognised (I hope you can recreate the issue and check on your end if needed). If send only --platfom=linux/s390x test runs fine for all branches. We have similar set up in release-1.11, where we send both KO_FLAGS and its working fine. Hence its taking time to troubleshoot and understand the issue why its happening in later releases than 1.11.
Since we have self-signed certificate on our registry we need the flag to exist. As @psschwei rightly pointed out, it's certainly the communication gap and we will address it in future.
@dprotaso I have moved the KO_DOCKER_REPO to IBM Cloud from self-hosted artifactory instance. Please approve the PR, this resolves the eventing issues we were facing. https://github.com/knative/infra/pull/351.
@dprotaso,
My understanding of the current status is:
Did I miss anything?
A reasonable path forward here is to keep testing on the two additional HW architectures and allow the teams to remove such gaps and ensure that the community can use Knative as is on the additional architectures.
We can reevaluate this in six months time to see the progress made.
I've talked to the productivity folks and they're in agreement with Roland's suggestion (here). PRs are out and are written in a way to make it easy to revert in the future.
Until the necessary dependencies support ppc/s390x out of the box we're effectively testing code that no-end user can checkout and run on their cluster without custom IBM patches. Like @upodroid mentioned these should be worked on in their respective projects.
Until then if these architectures are valuable for end-users it seems like IBM should create a Knative distribution for those architectures and we can link out to them on the Knative website. We do this for other vendors and their distributions.
For continuous testing you can take a look at Red Hat as an example - they have midstream repos for Knative and run their own prow instance. Given IBM already has prow clusters it would seem pretty incremental to host your own control plane - and use the resources in this and Red Hat repos as a guide.
@dprotaso IBM (as a HW vendor in this case) is not creating its own downstream distribution, it is supporting the use of the Knative distribution on additional HW architectures, like it does with other OS. Therefore, the midstream RedHat example is not a good one.
As a community, it makes no sense to reject one architecture over another, especially when we have no good reason.
In this case, the community already decided in the past to support the additional architectures, and there are community users using Knative on these architectures today. So this is not a decision that can be taken lightly for a community to stop supporting or to stop testing such architectures on new releases.
Note that we always pride that one can use different parts of Knative independently, and we as a community support open APIs and allowing users to use what they need out of Knative, so the argument that one dependency of one piece in the entire Knative distribution is still under work for this architecture, is not a reason to stop supporting community users using this architecture by stopping the release cycle.
I have added this to be discussed in the TOC.
(The PR @dprotaso was referring to is #357)
Yes as @davidhadas mentioned we at IBM are maintaining the knative enablement for s390x and ppc64le architectures. If we have issues with test cases - we will work on priority to fix them. We will open a new issue to re-enable s390x and ppc64le, need knative community to support. Thanks @davidhadas @psschwei for your comments.
Apparently #356 slipped in although there is no agreement on this. I have started #360 to revert it until an agreement is reached.
Summary of today's TOC call:
The periodic CI jobs for P and Z architectures stay in place in current format
Knative community asks for the P/Z teams to aim to contribute more in the Productivity WG tasks
In a 6 months period revisit the topic to check progress on Envoy changes required for Istio on P and Z.
Stretch goal: introduce e2e tests setup with Istio for Serving. In the current limitation there's no coverage. In additionto the Envoy efforts, if there's alternative open source proxy that can be used for Istio. The new job to cover for the scenario should be introduced.
I've tried to capture main points from the discussion.
@knative/technical-oversight-committee @knative/productivity-wg-leads
In a 6 months period revisit the topic to check progress on Envoy changes required for Istio on P and Z.
This falls into the v1.16 milestone - will circle back when that sprint starts.
I assume Istio
is just one option.
Kourier
is another.
Counter
a third.
We aim to ensure that community users can install Knative with a corresponding OS networking layer supported by Knative. Such that users can follow Knative documentation to get up and running with Knative.
It is nice if all networking layers are supported but not necessary.
@pleia2 can you please check if Z or Power teams will be affected, or how they can take this in-house if needed? Just in case it affects plans for IBM Secure Service Container.
@xnox, you can contact the Z or Power teams via slack knative-s390x-ppc
@xnox Thanks for the heads up, I'll check internally with my teams at IBM, but I'll also follow the lead of @davidhadas here regarding the knative-s390x-pcc channel, since there are some key folks publicly engaged there from both Power and Z (I've also just joined)
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
Do we still need to monitor this? Or can we close it?
/remove-lifecycle stale
Do we still need to monitor this?
Yes we should monitor this. I haven't really seen IBM follow through on their commitments from that TOC meeting back in Feb.
Or can we close it?
I figured it's worth providing you the full 6 months as was discussed. There's still a month left or so.
+1 for revisiting in Aug as discussed
Hi all, there are efforts going on for enabling envoy-openssl for IBM Z and IBM P platform. You can see most of the work needed for boringssl and openssl compact library is completed. There are few more packages needs to be worked upon and post that we will have envoy-openssl available on our platform as well. Here are some PR on the same. [1] https://github.com/envoyproxy/envoy-openssl/pull/166. [2] https://github.com/envoyproxy/envoy-openssl/pull/219 [3] https://github.com/envoyproxy/envoy/pull/34483
We also have efforts on publishing Envoy (based on OpenSSL) IBM P/Z images alongside of x86 images. https://github.com/envoyproxy/envoy-openssl/issues/221
installation document for IBM s390x/ IBM Power https://github.com/knative/docs/pull/6043
Following up here - I'm inclined to remove the jobs.
Main reason is the jobs were faililng for over two weeks and it was only noticed a day ago. Clearly it's not a priority and I don't think Knative OSS project should be running these tests on behalf of a vendor.
@dprotaso there were intermediate issues with infra in IBM cloud because of noise created because by infra it was noticed late when infra issues were fixed but still CI kept failing. We are actively monitoring the jobs and continue to focus on keeping CI healthy.
At the moment there are users running knative on kubernetes on s390x and if we remove the support it impacts those users too. One of the ask from these users was for documentations and we have also update the docs as well. We are actively working to get envoy-openssl support for P/Z and we are almost at the point where we just have to publish images.
At the moment there are users running knative on kubernetes on s390x and if we remove the support it impacts those users too.
We're not removing support - we're removing the CI jobs
We're not removing support - we're removing the CI jobs
@dprotaso I'm not really getting how the binaries/images are delivered in future if CI is removed. Are we going to deliver binaries without CI? Can you propose for a call,so we can have better discussion to understand each other pov. what do you think?
I believe originally the ppc/s390x jobs were added to test knative on different architectures with hardware supplied by IBM.
I though this was to the benefit to the IBM CodeEngine folks. Confirming with @psschwei CodeEngine doesn't use these architectures (anymore?).
The other bit we don't have anyone really looking at the tests and fixing them https://testgrid.k8s.io/r/knative-own-testgrid/serving#s390x-contour-tests
Furthermore - it's not clear if users can even run Knative on s390x with OSS - eg. kourier & istio envoy images are only arm and amd64.
I'm thinking we should just drop testing these architectures, remove the prow jobs and inform IBM that we no longer need those prow clusters.