knative / infra

Home of Infra (Productivity) that hosts configs for prow and other infrastructure related things.
Apache License 2.0
8 stars 27 forks source link

Are s390x/ppc jobs still valuable? #344

Closed dprotaso closed 3 months ago

dprotaso commented 10 months ago

I believe originally the ppc/s390x jobs were added to test knative on different architectures with hardware supplied by IBM.

I though this was to the benefit to the IBM CodeEngine folks. Confirming with @psschwei CodeEngine doesn't use these architectures (anymore?).

The other bit we don't have anyone really looking at the tests and fixing them https://testgrid.k8s.io/r/knative-own-testgrid/serving#s390x-contour-tests

Furthermore - it's not clear if users can even run Knative on s390x with OSS - eg. kourier & istio envoy images are only arm and amd64.

I'm thinking we should just drop testing these architectures, remove the prow jobs and inform IBM that we no longer need those prow clusters.

dilipgb commented 4 months ago

https://github.com/knative/infra/pull/484

davidhadas commented 4 months ago

We're not removing support - we're removing the CI jobs

Please dont remove CI jobs. We have discusaed this in the past and we can discuss it again. Please however do not remove CI jobs without involving the relevent teams and getting to an agreement.

dprotaso commented 4 months ago

@dprotaso I'm not really getting how the binaries/images are delivered in future if CI is removed.

Our release jobs produce the images and binaries. The s390x/ppc jobs this issue is about doesn't refer to that.

dprotaso commented 4 months ago

Please dont remove CI jobs. We have discusaed this in the past and we can discuss it again. Please however do not remove CI jobs without involving the relevent teams and getting to an agreement.

We've already discussed this and have set some expectations that were documented in this comment: https://github.com/knative/infra/issues/344#issuecomment-1944174280

Going over the expectations:

1. The ask for respective maintainers/teams to stabilize the runs

As of this morning the tests are still unstable

2. Regularly Monitor the results and proactively fix or raise issues with Knative bits or infra

Tests were broken for 2+ weeks before the issue was surfaced

3. Knative community asks for the P/Z teams to aim to contribute more in the Productivity WG tasks

Didn't see further engagement in the productivity working unrelated to these jobs

4. In a 6 months period revisit the topic to check progress on Envoy changes required for Istio on P and Z. Goal is to have community releases of Knative usable on respective architectures with 3rd party dependecies available as well.

The goal hasn't been met. Patching the knative installation files with a redhat image that requires a RedHat account doesn't really meet the bar here.

Again we are not dropping support for s390x/ppc - we're dropping these random CI jobs

dilipgb commented 4 months ago

Please dont remove CI jobs. We have discusaed this in the past and we can discuss it again. Please however do not remove CI jobs without involving the relevent teams and getting to an agreement.

We've already discussed this and have set some expectations that were documented in this comment: #344 (comment)

Going over the expectations:

1. The ask for respective maintainers/teams to stabilize the runs

[DP]: As of this morning the tests are still unstable [DB]: Tests are failing because IBM Z/ IBM P images for serving images are not released for release-1.15. We are unable to reduce the noise because of that.

2. Regularly Monitor the results and proactively fix or raise issues with Knative bits or infra

[DP]: Tests were broken for 2+ weeks before the issue was surfaced [DB]: Test broke because of changes in chain guard on July 16. We reported the issue on July 24. Approximately a week we took because there was some flakiness in infra scripts which I mentioned in slack as well.

3. Knative community asks for the P/Z teams to aim to contribute more in the Productivity WG tasks

[DP]: Didn't see further engagement in the productivity working unrelated to these jobs [DB]: We are expanding functionality of knative on s390x/ppc64le by enabling required packages from different communities like buildpacks which is needed for knative functions https://github.com/buildpacks/lifecycle/pull/1142. Trust manager is enabled for knative eventing tests https://github.com/cert-manager/trust-manager/pull/315. At the moment we are trying work with paketo-buildpacks trying to get the support for multiarch. Also we are doing some analysis to enable keda officially. These all the toolings are necessary for running knative on s390x.

4. In a 6 months period revisit the topic to check progress on Envoy changes required for Istio on P and Z. Goal is to have community releases of Knative usable on respective architectures with 3rd party dependecies available as well.

[DP]: The goal hasn't been met. Patching the knative installation files with a redhat image that requires a RedHat account doesn't really meet the bar here. [DB]: This is a stop gap solution, we are waiting for envoy-openssl release here https://github.com/envoyproxy/envoy-openssl/issues/221. Once that is done we can support envoy without these images.

Again we are not dropping support for s390x/ppc - we're dropping these random CI jobs

upodroid commented 4 months ago

I concur with @dprotaso's assessment that s390x/ppc jobs should be removed.

  1. The jobs are too flaky and aren't passing consistently. For a while, there were broken fully for 2 months which wasn't resolved till Dave mentioned that we are planning on removing the jobs
  2. The ecosystem changes we asked for haven't been shipped yet.
  3. I don't see any contributions to non s390x/ppc Knative Productivity Issues.
  4. The project's CI cost has exceeded its budget and I'm adjusting job frequencies. Look at https://github.com/knative/infra/pull/494 and https://github.com/knative/hack/pull/389

https://testgrid.k8s.io/r/knative-own-testgrid/client#ppc64le-e2e-tests broken for 2 months till recently https://testgrid.k8s.io/r/knative-own-testgrid/serving#ppc64le-kourier-tests flakes too frequently https://testgrid.k8s.io/r/knative-own-testgrid/serving#s390x-kourier-tests

valen-mascarenhas14 commented 4 months ago
  1. The jobs are too flaky and aren't passing consistently. For a while, there were broken fully for 2 months which wasn't resolved till Dave mentioned that we are planning on removing the jobs

@upodroid As previosly mentioned by Dilip in the above comments, the failures were primarily due to the changes in Chain guard and infra-related problems. That was also a reason why it took us sometime to figure out the cause behind these failures. These issues were communicated and discussed in the Slack group. Although the jobs experienced intermittent flakiness, we diligently debugged & resolved these issues as they arose. It's important to clarify that we were actively working on these issues and did not wait till 2 months until the potential removal of the jobs was suggested to address them. Our efforts ensured that any disruptions were minimized, and fixes were implemented in a timely manner.

dilipgb commented 3 months ago

https://github.com/knative/infra/pull/495