knative / infra

Home of Infra (Productivity) that hosts configs for prow and other infrastructure related things.
Apache License 2.0
8 stars 27 forks source link

Are s390x/ppc jobs still valuable? #344

Closed dprotaso closed 3 months ago

dprotaso commented 10 months ago

I believe originally the ppc/s390x jobs were added to test knative on different architectures with hardware supplied by IBM.

I though this was to the benefit to the IBM CodeEngine folks. Confirming with @psschwei CodeEngine doesn't use these architectures (anymore?).

The other bit we don't have anyone really looking at the tests and fixing them https://testgrid.k8s.io/r/knative-own-testgrid/serving#s390x-contour-tests

Furthermore - it's not clear if users can even run Knative on s390x with OSS - eg. kourier & istio envoy images are only arm and amd64.

I'm thinking we should just drop testing these architectures, remove the prow jobs and inform IBM that we no longer need those prow clusters.

upodroid commented 10 months ago

+1 to removing s390x/ppc64le jobs

dprotaso commented 10 months ago

Sorta related I'll bring this up with TOC - to even consider dropping s390x/ppc support in our releases.

I don't think our releases work on those arch's anyway - given kourier/istio envoy images don't support it (https://explore.ggcr.dev/?image=envoyproxy%2Fenvoy%3Av1.29.0)

I'm sourcing some data from the mailing lists https://groups.google.com/g/knative-users/c/ORwp3KlFbds https://groups.google.com/g/knative-dev/c/D-UkD3xPtFA

rishikakedia commented 10 months ago

We as part of enabling OpenShift Serverless for s390x and ppc64le architectures are actively working on knative upstream release to keep them updated. The members here working actively are @dilipgb (for s390x) and @valen-mascarenhas14 (for pp64le)

With respect to istio envoy images - we leverage maistra/envoy packages (midstream of istio envoy packages) for testing knative functionalities. There is active work happening on maintaining maistra/envoy for s390x and ppc64le architectures.

upodroid commented 10 months ago

@rishikakedia Were you able to upstream the changes required to support s390x/pp64le architectures to Envoy and Istio?

I believe IBM/RH are key maintainers of Istio(not sure about Envoy)

dilipgb commented 10 months ago

@upodroid we pick the maistra/envoy images that are needed for knative upstream and patch the code through our ci scripts before we run the tests (refer here: https://github.com/knative/infra/blob/main/prow/jobs_config/knative/serving.yaml#L186).

There are some of tests arbitrarily failing for contour and Kourier and we are also trying to debug those issues. It takes some more time for us to figure this out. For example, in s390x today we have kourier job passed but it was failed yesterday, similarly we had contour run successfully on Monday. We need some more time to fix these issues.

Also when cron schedule for latest and main conflicts (when release happens), we will see lot of failure in our CI because jobs will compete for same resource to run tests. We make adjust cron schedule to fix them.

valen-mascarenhas14 commented 9 months ago

@upodroid Recently, we've implemented significant changes to our testing infrastructure on the ppc64le side. This included migrating all Knative workloads to a different workspace within IBM Cloud. As a result, modifications were necessary, such as updating secrets, adjusting cronjob timings, and refining ppc64le-specific scripts. These changes led to a few failures during the transition period. However we have successfully addressed these issues, and the system is now functioning smoothly. Although we encountered some intermittent failures during the transition, we have diligently resolved them, ensuring that the platform is now performing as expected.

cardil commented 9 months ago

I agree with @upodroid. This work would be better utilized when done on Istio/Envoy directly, by adding a proper support for P/Z architecture there.

Doing it on Knative level is always going to be chasing a moving target...

rishikakedia commented 9 months ago

So, there are recent discussions started on having P/Z teams enabling upstream CI to publish images.

ghatwala commented 9 months ago

Seems like this openshift CI - https://github.com/openshift/release/tree/master/ci-operator/step-registry/servicemesh is being used to run e2e tests.

rishikakedia commented 9 months ago

We are enabling istio/envoy under the hood of maistra/envoy for s390x and ppc64le architectures. There is roadmap discussion to enable envoy based on openssl for these architecture to be compatible with upstream.

dprotaso commented 9 months ago

I agree with @upodroid

From my perspective none of our releases work on ppc/s390x without these patches. So I don't really see the utility of these jobs being in our CI from an OSS perspective. There's no benefit to end-users of Knative who consume the releases we produce.

We as part of enabling OpenShift Serverless for s390x and ppc64le architectures

Would it make more sense to add these tests to the RH/IBM midstream repos rather than here?

rishikakedia commented 9 months ago

Here is the associated PR for enabling envoyproxy/envoy to be openssl based for s390x: https://github.com/envoyproxy/envoy-openssl/pull/128

upodroid commented 9 months ago

Fyi, what you need to do is get s390x/ppc64le binaries added to https://github.com/envoyproxy/envoy/releases/tag/v1.29.0

clnperez commented 9 months ago

FYI @upodroid I'd love to, but, Google dropped us from their CI platform, so we can't get boring-ssl support back -- hence @rishikakedia's mention of the openssl roadmap. (She's on the s390x side of the IBM house. I'm on the ppc64le side.)

For reference: https://github.com/envoyproxy/envoy/pull/28363

Given @valen-mascarenhas14's comment -- these issues seem to be worked on the ppc64le side. Does that mean there are no issues on Power? I'm trying both understand the situation and to line up all the folks with who they are and who they're referring to when they say "we." :D

upodroid commented 9 months ago

I read the envoy PR and the solution is to fix it properly in BoringSSL.

It seems patches do exist but you need to upstream them and give maintainer/vendor X real IBM hardware to test against those architectures.

https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/6435 https://github.com/linux-on-ibm-z/docs/wiki/Building-TensorFlow

All of this stuff needs to be upstreamed

upodroid commented 9 months ago

Fyi, I don't have anything against the s390x/ppcl64e platforms but I have to repeat these important best practices(which might be done but not visible to me/public).

I wonder how much hassle we'll go through RISC-V when it becomes a thing in the future.

rishikakedia commented 9 months ago

@upodroid : we did internal assessment and we believe that https://github.com/envoyproxy/envoy-openssl should be enabled for s390x and ppc64le architecture by first half of 2024. So I suggest we discuss about this issue of knative upstream CI post that availability?

rhuss commented 9 months ago

An idea could be to instead of removing the jobs, we could disable them (and also don't do any upstream release for those platforms) and reconsider to enable them when there are official ports for those archs for envoy ? We can set a date, let's say 2024-08-01 and when there is no P/Z port for envoy we then can remove the jobs completely.

rishikakedia commented 9 months ago

@upodroid FYI: we use prow to trigger jobs but infra for testing is provided by P/Z teams by provisioning capacity on ibm cloud.

dprotaso commented 9 months ago

An idea could be to instead of removing the jobs, we could disable them (and also don't do any upstream release for those platforms) and reconsider to enable them when there are official ports for those archs for envoy ? We can set a date, let's say 2024-08-01 and when there is no P/Z port for envoy we then can remove the jobs completely.

Yeah this sounds good

clnperez commented 9 months ago

@upodroid -- Google is the maintainer of boringssl, and they removed support for power explicitly (see https://github.com/google/boringssl/commit/7d2338d000eb1468a5bbf78e91854236e18fb9e4). I asked one of the maintainers about adding our hardware back. It's not just a matter of upstreaming, or giving them hardware. It's complicated, but they let people know not to rely on it in their README:

BoringSSL is a fork of OpenSSL that is designed to meet Google's needs.

Although BoringSSL is an open source project, it is not intended for general use, as OpenSSL is. We don't recommend that third parties depend upon it. 

So we're stuck between a rock and a hard place here because third parties are depending on it.

All that said, thanks to everyone for the consideration and flexibility.

psschwei commented 9 months ago

If I'm understanding correctly, what prompted this issue is that it wasn't clear if these tests were being maintained. To my understanding they are being maintained, failures/flakes are being fixed, etc. although that maintenance may not have been communicated especially well. So given that, I don't think we need to drop them as long as they're being actively maintained.

dprotaso commented 9 months ago

To my understanding they are being maintained, failures/flakes are being fixed, etc.

Not really. For example, I just checked and s390x eventing tests aren't being actually run - but CI is showing green (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#s390x-e2e-tests&width=20) because something is overriding the exit value of the script.

valen-mascarenhas14 commented 9 months ago

Not really. For example, I just checked and s390x eventing tests aren't being actually run - but CI is showing green (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#s390x-e2e-tests&width=20) because something is overriding the exit value of the script.

@dprotaso I can see all the tests are running & passing for ppc64le eventing tests (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#ppc64le-e2e-tests&width=20)

dilipgb commented 9 months ago

To my understanding they are being maintained, failures/flakes are being fixed, etc.

Not really. For example, I just checked and s390x eventing tests aren't being actually run - but CI is showing green (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#s390x-e2e-tests&width=20) because something is overriding the exit value of the script.

@dprotaso the eventing jobs failure on s390x, we are actively debugging. Since we are sending 2 flags (--platfom=linux/s390x --insecure_registry) to KO_FLAGS, the platform is not getting recognised (I hope you can recreate the issue and check on your end if needed). If send only --platfom=linux/s390x test runs fine for all branches. We have similar set up in release-1.11, where we send both KO_FLAGS and its working fine. Hence its taking time to troubleshoot and understand the issue why its happening in later releases than 1.11.

Since we have self-signed certificate on our registry we need the flag to exist. As @psschwei rightly pointed out, it's certainly the communication gap and we will address it in future.

dilipgb commented 9 months ago

@dprotaso I have moved the KO_DOCKER_REPO to IBM Cloud from self-hosted artifactory instance. Please approve the PR, this resolves the eventing issues we were facing. https://github.com/knative/infra/pull/351.

davidhadas commented 9 months ago

@dprotaso,

My understanding of the current status is:

  1. There are teams working on the 2 additional architectures that clearly ask for the tests to continue.
  2. They are committed to support these tests and ensure Knative work on the additional architectures
  3. The costs of the additional hardware needed for testing is covered by IBM
  4. Significant parts of Knative can be used as is by the community on these additional HW architectures, but there are some identified gaps (envoy) that are presently being worked on by the teams.

Did I miss anything?

A reasonable path forward here is to keep testing on the two additional HW architectures and allow the teams to remove such gaps and ensure that the community can use Knative as is on the additional architectures.

We can reevaluate this in six months time to see the progress made.

dprotaso commented 9 months ago

I've talked to the productivity folks and they're in agreement with Roland's suggestion (here). PRs are out and are written in a way to make it easy to revert in the future.

Until the necessary dependencies support ppc/s390x out of the box we're effectively testing code that no-end user can checkout and run on their cluster without custom IBM patches. Like @upodroid mentioned these should be worked on in their respective projects.

Until then if these architectures are valuable for end-users it seems like IBM should create a Knative distribution for those architectures and we can link out to them on the Knative website. We do this for other vendors and their distributions.

For continuous testing you can take a look at Red Hat as an example - they have midstream repos for Knative and run their own prow instance. Given IBM already has prow clusters it would seem pretty incremental to host your own control plane - and use the resources in this and Red Hat repos as a guide.

davidhadas commented 9 months ago

@dprotaso IBM (as a HW vendor in this case) is not creating its own downstream distribution, it is supporting the use of the Knative distribution on additional HW architectures, like it does with other OS. Therefore, the midstream RedHat example is not a good one.

As a community, it makes no sense to reject one architecture over another, especially when we have no good reason.

In this case, the community already decided in the past to support the additional architectures, and there are community users using Knative on these architectures today. So this is not a decision that can be taken lightly for a community to stop supporting or to stop testing such architectures on new releases.

Note that we always pride that one can use different parts of Knative independently, and we as a community support open APIs and allowing users to use what they need out of Knative, so the argument that one dependency of one piece in the entire Knative distribution is still under work for this architecture, is not a reason to stop supporting community users using this architecture by stopping the release cycle.

I have added this to be discussed in the TOC.

(The PR @dprotaso was referring to is #357)

rishikakedia commented 9 months ago

Yes as @davidhadas mentioned we at IBM are maintaining the knative enablement for s390x and ppc64le architectures. If we have issues with test cases - we will work on priority to fix them. We will open a new issue to re-enable s390x and ppc64le, need knative community to support. Thanks @davidhadas @psschwei for your comments.

davidhadas commented 9 months ago

Apparently #356 slipped in although there is no agreement on this. I have started #360 to revert it until an agreement is reached.

dsimansk commented 9 months ago

Summary of today's TOC call:

Stretch goal: introduce e2e tests setup with Istio for Serving. In the current limitation there's no coverage. In additionto the Envoy efforts, if there's alternative open source proxy that can be used for Istio. The new job to cover for the scenario should be introduced.

I've tried to capture main points from the discussion.

@knative/technical-oversight-committee @knative/productivity-wg-leads

dprotaso commented 9 months ago

In a 6 months period revisit the topic to check progress on Envoy changes required for Istio on P and Z.

This falls into the v1.16 milestone - will circle back when that sprint starts.

davidhadas commented 9 months ago

I assume Istio is just one option. Kourier is another. Counter a third.

We aim to ensure that community users can install Knative with a corresponding OS networking layer supported by Knative. Such that users can follow Knative documentation to get up and running with Knative.

It is nice if all networking layers are supported but not necessary.

xnox commented 9 months ago

@pleia2 can you please check if Z or Power teams will be affected, or how they can take this in-house if needed? Just in case it affects plans for IBM Secure Service Container.

davidhadas commented 9 months ago

@xnox, you can contact the Z or Power teams via slack knative-s390x-ppc

pleia2 commented 9 months ago

@xnox Thanks for the heads up, I'll check internally with my teams at IBM, but I'll also follow the lead of @davidhadas here regarding the knative-s390x-pcc channel, since there are some key folks publicly engaged there from both Power and Z (I've also just joined)

dilipgb commented 8 months ago

https://github.com/knative-extensions/eventing-kafka-broker/pull/3777

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

davidhadas commented 5 months ago

Do we still need to monitor this? Or can we close it?

davidhadas commented 5 months ago

/remove-lifecycle stale

dprotaso commented 5 months ago

Do we still need to monitor this?

Yes we should monitor this. I haven't really seen IBM follow through on their commitments from that TOC meeting back in Feb.

Or can we close it?

I figured it's worth providing you the full 6 months as was discussed. There's still a month left or so.

upodroid commented 5 months ago

+1 for revisiting in Aug as discussed

dilipgb commented 5 months ago

Hi all, there are efforts going on for enabling envoy-openssl for IBM Z and IBM P platform. You can see most of the work needed for boringssl and openssl compact library is completed. There are few more packages needs to be worked upon and post that we will have envoy-openssl available on our platform as well. Here are some PR on the same. [1] https://github.com/envoyproxy/envoy-openssl/pull/166. [2] https://github.com/envoyproxy/envoy-openssl/pull/219 [3] https://github.com/envoyproxy/envoy/pull/34483

rishikakedia commented 5 months ago

We also have efforts on publishing Envoy (based on OpenSSL) IBM P/Z images alongside of x86 images. https://github.com/envoyproxy/envoy-openssl/issues/221

dilipgb commented 4 months ago

installation document for IBM s390x/ IBM Power https://github.com/knative/docs/pull/6043

dprotaso commented 4 months ago

Following up here - I'm inclined to remove the jobs.

Main reason is the jobs were faililng for over two weeks and it was only noticed a day ago. Clearly it's not a priority and I don't think Knative OSS project should be running these tests on behalf of a vendor.

dilipgb commented 4 months ago

@dprotaso there were intermediate issues with infra in IBM cloud because of noise created because by infra it was noticed late when infra issues were fixed but still CI kept failing. We are actively monitoring the jobs and continue to focus on keeping CI healthy.

At the moment there are users running knative on kubernetes on s390x and if we remove the support it impacts those users too. One of the ask from these users was for documentations and we have also update the docs as well. We are actively working to get envoy-openssl support for P/Z and we are almost at the point where we just have to publish images.

dprotaso commented 4 months ago

At the moment there are users running knative on kubernetes on s390x and if we remove the support it impacts those users too.

We're not removing support - we're removing the CI jobs

dilipgb commented 4 months ago

We're not removing support - we're removing the CI jobs

@dprotaso I'm not really getting how the binaries/images are delivered in future if CI is removed. Are we going to deliver binaries without CI? Can you propose for a call,so we can have better discussion to understand each other pov. what do you think?