kubeflow / manifests

A repository for Kustomize manifests
Apache License 2.0
808 stars 869 forks source link

[Kubeflow 1.9] Distributions and Kubeflow 1.9 #2611

Closed rimolive closed 2 months ago

rimolive commented 8 months ago

This issue will be used to track the progress of and coordinate with distributions along the 1.9 release.

While we hope all distros will manage to be ready when the KF 1.9 release is out, this is sometimes difficult to achieve. In this issue, we want to both keep track of the progress of distributions towards the KF 1.9 release and also know which of the distros will be working on KF 1.9 (testing during the distribution testing cycle) even if they can't meet the KF 1.9 deadline.

Tagging distribution owners identified from previous releases (Any new or missed distro owners, please comment on this issue)

Distribution Representative(s) State
AWS @surajkota not participating in 1.9
Charmed Kubeflow @DnPlas participating in 1.9
Google Cloud @gkcalatbr>@zijianjoy<br@Linchin not participating in 1.9
IBM IKS @Tomcli
@yhwang
participating in 1.9
Microsoft not participating in 1.9
Nutanix @johnugeorge
@nagar-ajay
participating in 1.9
Red Hat OpenShift AI @rimolive participating in 1.9
Oracle Cloud Infrastructure @julioo participating in 1.9
DeployKF @thesuperzapper participating in 1.9
VMWare @liuqi
@xujinheng
participating in 1.9
QBO @alexeadem participating in 1.9

Please let us know if you'll be participating in the 1.9 release by answering the following questions:

Please note the release timelines are being discussed in kubeflow/manifests#2606.

cc @kubeflow/release-team @jbottum

ca-scribner commented 8 months ago

@rimolive can you remove @DnPlas from Charmed Kubeflow and replace her with myself? ty!

to your questions, for Charmed Kubeflow:

thesuperzapper commented 8 months ago

@rimolive deployKF will participate in 1.9, but it's not 100% clear exactly what that will look like.


Separately, given "Kubeflow on AWS" did not participate in 1.8, and announced they were no longer supporting their distribution in https://github.com/awslabs/kubeflow-manifests/issues/794, I think its unlikely they will do 1.9?

Given this, I proposed moving them to "legacy" on the Kubeflow website on this PR https://github.com/kubeflow/website/pull/3641.

However, I also want to avoid confusion with users, because they might think that Kubeflow no longer supports AWS due to the "Kubeflow on AWS" name. So I also think we should merge https://github.com/kubeflow/website/pull/3643 at the same time, which tells users that "Kubeflow on XXXX" is just a name, and NOT the ONLY way to use Kubeflow on that platform.

yhwang commented 8 months ago

For IBM IKS:

Are you planning on having your distro ready in sync with the KF 1.9 release?

Yes

Will you participate by testing your distro during the distribution testing phase and providing feedback (reporting any issues to the release team)?

Yes

liuqi commented 8 months ago

For VMware Distro:

Are you planning on having your distro ready in sync with the KF 1.9 release?

Yes

Will you participate by testing your distro during the distribution testing phase and providing feedback (reporting any issues to the release team)?

Yes

alexeadem commented 8 months ago

For QBO Distro:

Are you planning on having your distro ready in sync with the KF 1.9 release?

Yes

Will you participate by testing your distro during the distribution testing phase and providing feedback (reporting any issues to the release team)?

Yes

tiansiyuan commented 7 months ago

For VMware Distro:

Are you planning on having your distro ready in sync with the KF 1.9 release?

Yes

Will you participate by testing your distro during the distribution testing phase and providing feedback (reporting any issues to the release team)?

Yes

rimolive commented 5 months ago

Calling all Distribution owners! I'm proud to announce our first Release Candidate for Kubeflow 1.9!

You can find the release details in the following URL:

https://github.com/kubeflow/manifests/releases/tag/v1.9.0-rc.0

We'll be working on another Release Candidate when we have Notebooks and KServe Models Webapp updated and ready for KF 1.9. We can use this issue to keep track of blocker issues for distributions while we work on fixing them.

cc @ca-scribner @yhwang @johnugeorge @nagar-ajay @thesuperzapper @liuqi @xujinheng @alexeadem @alex-treebeard

juliusvonkohout commented 5 months ago

We also have to update cert-manager, knative, istio, seldon, bentoml etc which will come in later RCs.

StefanoFioravanzo commented 4 months ago

@ca-scribner @yhwang @johnugeorge @nagar-ajay @thesuperzapper @liuqi @xujinheng @alexeadem @alex-treebeard Can you please acknowledge that you are aware of Kubeflow 1.9 RC0 and are aware the the distributions testing phase has started? Please react with a thumbs up if everything is okay from your side and you are proceeding with testing.

thesuperzapper commented 4 months ago

deployKF is mostly waiting on the updates from Notebooks (https://github.com/kubeflow/kubeflow/issues/7453), but I am aware that a 1.9.0-RC0 was cut with other components.

alexeadem commented 4 months ago

What do we mean by '(around 1.28)' here: https://github.com/kubeflow/manifests/tree/v1.9.0-rc.0?tab=readme-ov-file#prerequisites

Is that v1.28.0 and v1.27.11?

I'm proceeding with the testing in QBO.

OK: Everything is looking good in QBO. Tested by doing a vector addition test.

Details:

git branch
* (HEAD detached at v1.9.0-rc.0)

In Kubernetes v1.28.0:

qbo get nodes kubeflow_v1_9_0_nvidia | jq .nodes[]?.image
"kindest/node:v1.28.0"
"kindest/node:v1.28.0"
"kindest/node:v1.28.0"

with NVIDIA GPU Operator

helm list -n gpu-operator
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /home/alex/.qbo/kubeflow_v1_9_0_nvidia.conf
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/alex/.qbo/kubeflow_v1_9_0_nvidia.conf
NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
gpu-operator-1715634796 gpu-operator    1               2024-05-13 21:13:18.636880948 +0000 UTC deployed        gpu-operator-v24.3.0    v24.3.0 

And Kustomize

./kustomize version
v5.4.1

It looks like platform-agnostic-multi-user-pns is not longer available ./kustomize build apps/pipeline/upstream/env/platform-agnostic-multi-user-pns | kubectl apply -f -

as per https://github.com/kubeflow/pipelines/issues/5285

So I used the following instead. I'll update the QBOT installer for this version ./kustomize build apps/pipeline/upstream/env/platform-agnostic-multi-user | kubectl apply -f -

This is what it was deployed

kubectl get pods --all-namespaces -o jsonpath="{..image}" | sed 's/ /\n/g' | sort | uniq
docker.io/istio/pilot:1.17.5
docker.io/istio/proxyv2:1.17.5
docker.io/kindest/kindnetd:v20220726-ed811e41
docker.io/kindest/local-path-provisioner:v0.0.22-kind.0
docker.io/kserve/kserve-controller:v0.12.1
docker.io/kserve/models-web-app:v0.10.0
docker.io/kubeflow/training-operator:v1-f8f7363
docker.io/kubeflowkatib/katib-controller:v0.17.0-rc.0
docker.io/kubeflowkatib/katib-db-manager:v0.17.0-rc.0
docker.io/kubeflowkatib/katib-ui:v0.17.0-rc.0
docker.io/kubeflownotebookswg/centraldashboard:v1.8.0
docker.io/kubeflownotebookswg/jupyter-scipy:v1.8.0
docker.io/kubeflownotebookswg/jupyter-web-app:v1.8.0
docker.io/kubeflownotebookswg/kfam:v1.8.0
docker.io/kubeflownotebookswg/notebook-controller:v1.8.0
docker.io/kubeflownotebookswg/poddefaults-webhook:v1.8.0
docker.io/kubeflownotebookswg/profile-controller:v1.8.0
docker.io/kubeflownotebookswg/pvcviewer-controller:v1.8.0
docker.io/kubeflownotebookswg/tensorboard-controller:v1.8.0
docker.io/kubeflownotebookswg/tensorboards-web-app:v1.8.0
docker.io/kubeflownotebookswg/volumes-web-app:v1.8.0
docker.io/library/mysql:8.0.29
docker.io/library/python:3.7
docker.io/metacontrollerio/metacontroller:v2.0.4
gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:92967bab4ad8f7d55ce3a77ba8868f3f2ce173c010958c28b9a690964ad6ee9b
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:ebf93652f0254ac56600bedf4a7d81611b3e1e7f6526c6998da5dd24cdc67ee1
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:421aa67057240fa0c56ebf2c6e5b482a12842005805c46e067129402d1751220
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:bfa1dfea77aff6dfa7959f4822d8e61c4f7933053874cd3f27352323e6ecd985
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c2994c2b6c2c7f38ad1b85c71789bf1753cc8979926423c83231e62258837cb9
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:8319aa662b4912e8175018bd7cc90c63838562a27515197b803bdcd5634c7007
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:98a2cc7fd62ee95e137116504e7166c32c65efef42c3d1454630780410abf943
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:7368aaddf2be8d8784dc7195f5bc272ecfe49d429697f48de0ddc44f278167aa
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:f66c41ad7a73f5d4f4bdfec4294d5459c477f09f3ce52934d1a215e32316b59b
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:4305209ce498caf783f39c8f3e85dfa635ece6947033bf50b0b627983fd65953
gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0
gcr.io/ml-pipeline/api-server:2.2.0
gcr.io/ml-pipeline/cache-deployer:2.2.0
gcr.io/ml-pipeline/cache-server:2.2.0
gcr.io/ml-pipeline/frontend:2.2.0
gcr.io/ml-pipeline/metadata-envoy:2.2.0
gcr.io/ml-pipeline/metadata-writer:2.2.0
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/persistenceagent:2.2.0
gcr.io/ml-pipeline/scheduledworkflow:2.2.0
gcr.io/ml-pipeline/viewer-crd-controller:2.2.0
gcr.io/ml-pipeline/visualization-server:2.2.0
gcr.io/ml-pipeline/workflow-controller:v3.4.16-license-compliance
gcr.io/tfx-oss-public/ml_metadata_store_server:1.14.0
ghcr.io/dexidp/dex:v2.36.0
kserve/kserve-controller:v0.12.1
kserve/models-web-app:v0.10.0
kubeflow/training-operator:v1-f8f7363
kubeflownotebookswg/jupyter-scipy:v1.8.0
mysql:8.0.29
nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.3.0
nvcr.io/nvidia/gpu-operator:v24.3.0
nvcr.io/nvidia/k8s-device-plugin:v0.15.0-ubi8
nvcr.io/nvidia/k8s/container-toolkit:v1.15.0-ubuntu20.04
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
python:3.7
quay.io/jetstack/cert-manager-cainjector:v1.12.2
quay.io/jetstack/cert-manager-controller:v1.12.2
quay.io/jetstack/cert-manager-webhook:v1.12.2
quay.io/oauth2-proxy/oauth2-proxy:v7.6.0
registry.k8s.io/coredns/coredns:v1.10.1
registry.k8s.io/etcd:3.5.9-0
registry.k8s.io/kube-apiserver:v1.28.0
registry.k8s.io/kube-controller-manager:v1.28.0
registry.k8s.io/kube-proxy:v1.28.0
registry.k8s.io/kube-scheduler:v1.28.0
registry.k8s.io/nfd/node-feature-discovery:v0.15.4
juliusvonkohout commented 4 months ago

@alexeadem please check the updated release notes https://github.com/kubeflow/manifests/releases/tag/v1.9.0-rc.0 1.27-1.29 officially Yes, we made emissary the default in 1.7 or 1.8

DnPlas commented 4 months ago

Hi @rimolive @StefanoFioravanzo, a couple of things:

  1. Could I please ask to replace @ca-scribner with me as the distribution owner?
  2. We are aware that the distribution testing phase has started, but we have identified that components from the kubeflow/kubeflow repository are missing. Is this something coming in another RC? Is this planned?
rimolive commented 4 months ago

Hi @rimolive @StefanoFioravanzo, a couple of things:

  1. Could I please ask to replace @ca-scribner with me as the distribution owner?

Done

  1. We are aware that the distribution testing phase has started, but we have identified that components from the kubeflow/kubeflow repository are missing. Is this something coming in another RC? Is this planned?

We decided to move on with rc0 because many components were upgraded, but there's a plan for rc1 with the remainder components. Is there one specific you are expecting to test?

rimolive commented 4 months ago

Just an update: We have just released Kubeflow 1.9.0-rc.1, which includes all updates from the Notebooks WG, Istio 1.18.7 (targetting to fully upgrade to 1.22 until the final release), and Model Registry 0.2.1-alpha. We ask all Distributions a help with testing the new release and open issues so we can work with the Working Groups to fix them until the final release.

You can find the Release Notes in the releases page.

cc @ca-scribner @DnPlas @yhwang @johnugeorge @nagar-ajay @thesuperzapper @liuqi @xujinheng @alexeadem @alex-treebeard

nagar-ajay commented 4 months ago

Created an issue to track Nutanix distribution testing - https://github.com/nutanix/kubeflow-manifests/issues/21

rimolive commented 3 months ago

We are one week away from the Kubeflow 1.9.0-rc.2 release and we plan to be the last release candidate before final. We really welcome any updates about Distribution testing with bug reports, and anything that the release team should pursuit for rc.2 or final.

cc @ca-scribner @DnPlas @yhwang @johnugeorge @nagar-ajay @thesuperzapper @liuqi @xujinheng @alexeadem @alex-treebeard

DnPlas commented 3 months ago

We will start testing in the following two weeks, we'll keep you posted.

rimolive commented 3 months ago

Hello Distribution owners! Just wanted to announce Kubeflow 1.9.0-rc.2 release, it's the last one before we go final. Please take a look at the Release Notes here and help us validating the manifests by issuing a /lgtm comment in this issue.

cc @ca-scribner @DnPlas @yhwang @johnugeorge @nagar-ajay @thesuperzapper @liuqi @xujinheng @alexeadem @alex-treebeard

alexeadem commented 3 months ago

/lgtm Tested in QBO: api:cloud-stage-4.3.0.7aba1d45 Kubeflow: 1.9.0-rc.2 Kubernetes: v1.29.4 NVIDIA GPU operator:

helm list -n gpu-operator
adable. This is insecure. Location: /home/alex/.qbo/kubeflow_v1_9_0_nvidia.conf
NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
gpu-operator-1719811045 gpu-operator    1               2024-07-01 05:17:27.654568661 +0000 UTC deployed        gpu-operator-v24.3.0    v24.3.0 

Recording: https://youtu.be/-CrtjPsVbUY

rimolive commented 2 months ago

@liuqi @xujinheng Can you please confirm if you are testing Kubeflow 1.9.0-rc.2 manifests and let us know if it looks good?

rimolive commented 2 months ago

@yhwang Please let us know if you tested Kubeflow 1.9.0-rc.2 and it looks good.

yhwang commented 2 months ago

/lgtm

verified the 1.9.0-rc.2 on IKS using the following settings:

juliusvonkohout commented 2 months ago

Even better test the 1.9 branch in general because it will contain the final release https://github.com/kubeflow/manifests/commits/v1.9-branch/ and further fixes such as

xujinheng commented 2 months ago

Yes, we are currently testing Kubeflow 1.9.0-rc2. Once we complete our testing, we will post the results here to keep you informed.

nagar-ajay commented 2 months ago

/lgtm - verified workflows mentioned in the tracking issue. https://github.com/nutanix/kubeflow-manifests/issues/21

juliusvonkohout commented 2 months ago

Yes, we are currently testing Kubeflow 1.9.0-rc2. Once we complete our testing, we will post the results here to keep you informed.

As mentioned above, rc.2 does not contain all fixes.

tiansiyuan commented 2 months ago

Failed to pull image "docker.io/kserve/models-web-app": no matching manifest for linux/arm64/v8 in the manifest list entries

On a macbook with M3 CPU using minikube start --cpus 8 --memory 8192 --kubernetes-version=v1.29 --driver=docker

On Wed, Jul 10, 2024 at 6:56 PM Julius von Kohout @.***> wrote:

Yes, we are currently testing Kubeflow 1.9.0-rc2. Once we complete our testing, we will post the results here to keep you informed.

As mentioned above, rc.2 does not contain all fixes.

— Reply to this email directly, view it on GitHub https://github.com/kubeflow/manifests/issues/2611#issuecomment-2220196722, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABX2CZFOXFEOWK5UNLQRQ3ZLUHNXAVCNFSM6AAAAABCLB22WOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRQGE4TMNZSGI . You are receiving this because you commented.Message ID: @.***>

rimolive commented 2 months ago

@tiansiyuan This thread is exclusively to track work with the Kubeflow Distribution owners to test 1.9 release. Please open an issue in https://github.com/kserve/models-web-app

rimolive commented 2 months ago

This is the current status of the Distribution Testing on July 10th:

Distribution Representative(s) State
Charmed Kubeflow @DnPlas Pending
IBM IKS @Tomcli
@yhwang
LGTM
Nutanix @johnugeorge
@nagar-ajay
LGTM
Red Hat OpenShift AI @rimolive Pending
Oracle Cloud Infrastructure @julioo Pending
DeployKF @thesuperzapper Pending
VMWare @liuqi
@xujinheng
Pending
QBO @alexeadem LGTM

We need your updates as quick as possible as our release date is July 22nd and in case of any bug reports we can take actions on time.

tiansiyuan commented 2 months ago

https://github.com/kserve/models-web-app/issues/88

Done.

On Wed, Jul 10, 2024 at 9:26 PM Ricardo Martinelli de Oliveira < @.***> wrote:

@tiansiyuan https://github.com/tiansiyuan This thread is exclusively to track work with the Kubeflow Distribution owners to test 1.9 release. Please open an issue in https://github.com/kserve/models-web-app

— Reply to this email directly, view it on GitHub https://github.com/kubeflow/manifests/issues/2611#issuecomment-2220508381, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABX2CYKY6XWR3YHUWWD4ODZLUZAJAVCNFSM6AAAAABCLB22WOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRQGUYDQMZYGE . You are receiving this because you were mentioned.Message ID: @.***>

juliusvonkohout commented 2 months ago

Hello, please retest with the 1.9 branch https://github.com/kubeflow/manifests/commits/v1.9-branch/ given the merge of https://github.com/kubeflow/manifests/pull/2795 Testing RC.2 is not enough.

juliusvonkohout commented 2 months ago

If no further bugs come up, i will synchronize any last-minute release tags from the other working groups on June 20-21 and do the change log and final release on July 22.

@rimolive if i do not get any more final releases/tags from the other WGs, i probably have to release as is on july 22. You can also decide as release manager that we cut RC.3 and delay the final release.

rimolive commented 2 months ago

@juliusvonkohout hold on, we need the remaining WGs to cut their final releases. We cannot release 1.9 with components in RC releases.

juliusvonkohout commented 2 months ago

@juliusvonkohout hold on, we need the remaining WGs to cut their final releases. We cannot release 1.9 with components in RC releases.

To cite myself from a few messages above: "if i do not get any more final releases/tags from the other WGs, i probably have to release as is on July 22. You can also decide as release manager that we cut RC.3 and delay the final release."

I wont do anything today and when i am on vacation :-D As i said July 20-22 is when I can do the remaining stuff. But you have to decide what we do if the final releases/tags from other WGs are not available on July 22. This could be the case and was the case in the previous releases. If this is the case, the question arises whether you want to release anyway, or cut an RC.3 on July 22 and delay the final release. Just think about it ;-)

rimolive commented 2 months ago

This is the current status of the Distribution Testing on July 15th:

Distribution Representative(s) State
Charmed Kubeflow @DnPlas Pending
IBM IKS @Tomcli
@yhwang
LGTM
Nutanix @johnugeorge
@nagar-ajay
LGTM
Red Hat OpenShift AI @rimolive Pending
Oracle Cloud Infrastructure @julioo Pending
DeployKF @thesuperzapper Pending
VMWare @liuqi
@xujinheng
Pending
QBO @alexeadem LGTM

We had no changes in 5 days, and next week it's the Release date for 1.9. Please send us your updates so we can guarantee all Distributions are good with the release.

thesuperzapper commented 2 months ago

@juliusvonkohout @rimolive @StefanoFioravanzo I have cut the final v1.9.0 tag for the kubeflow/kubeflow repo, feel free to sync the manifests for this tag into kubeflow/manifests.

DnPlas commented 2 months ago

Hey @rimolive, here is my latest update:

version: 1.9.0-rc.2 platform:

So far it is looking good, so for that version /lgtm.

rimolive commented 2 months ago

Hello,

This is the status for today July 22nd:

Distribution Representative(s) State
Charmed Kubeflow @DnPlas LGTM
IBM IKS @Tomcli
@yhwang
LGTM
Nutanix @johnugeorge
@nagar-ajay
LGTM
Red Hat OpenShift AI @rimolive LGTM
Oracle Cloud Infrastructure @julioo Pending
DeployKF @thesuperzapper Pending
VMWare @liuqi
@xujinheng
Pending
QBO @alexeadem LGTM

We see the majority of distributions agreed on the state of the release. Thank you so much for everyone involved in the testing. We'll keep receiving feedbacks for cases we can consider work on patch releases for 1.9.

juliusvonkohout commented 2 months ago

Is someone here encountering this bug/PR ?

https://github.com/kubeflow/manifests/pull/2815 https://github.com/kubeflow/manifests/issues/2812 https://github.com/kubeflow/manifests/issues/2766

It has not been changed in 7 months https://github.com/kubeflow/manifests/commits/master/common/dex/base/config-map.yaml , but some users are complaining

alexeadem commented 2 months ago

Is someone here encountering this bug/PR ?

2815 #2812 #2766

It has not been changed in 7 months https://github.com/kubeflow/manifests/commits/master/common/dex/base/config-map.yaml , but some users are complaining

not in QBO