kubeflow / manifests

A repository for Kustomize manifests
Apache License 2.0
806 stars 867 forks source link

Distributions: Readiness for 1.3 Kubeflow Release #1798

Closed yanniszark closed 2 years ago

yanniszark commented 3 years ago

Problem Statement

We have a 6 step plan for releasing Kubeflow 1.3: https://github.com/kubeflow/manifests/issues/1777 This issue is for the 6th step: distributions use instructions from wg-manifests for how to use kustomizations and create distributions for 1.3.

Distribution Owners Readiness for 1.3
Arrikto EKF @kubeflow/arrikto ✔️
Arrikto MiniKF @kubeflow/arrikto ✔️
Azure @kubeflow/azure :x:
AWS @kubeflow/aws ✔️
Charmed Kubeflow @RFMVasconcelos ✔️
Google Cloud @kubeflow/google ✔️
IBM @kubeflow/ibm ✔️
Kubeflow on MicroK8s @RFMVasconcelos ✔️
Kubeflow Operator @kubeflow/red-hat ✔️
Kubeflow with Argo CD @DavidSpek ✔️
kfctl_istio_k8s Support dropped, remove from docs
kfctl_istio_dex Support dropped, remove from docs
Openshift @kubeflow/red-hat ✔️

Distributions should have owners and update their process/docs for 1.3 installations.

Distributions can use the instructions for installing Kubeflow 1.3 components and common services, provided by wg-manifests, to perform their integration: https://github.com/kubeflow/manifests/tree/v1.3.0-rc.0#readme

I would like to ask all distribution owners to check-in in this issue to confirm that they will support their distributions for Kubeflow 1.3.

Current Issues:

davidspek commented 3 years ago

Guess you can add Kubeflow On-Prem/ArgoFlow/Argo CD to this list with https://github.com/kubeflow-onprem/ArgoFlow and status Ready. /cc @jtfogarty @tbaums @RFMVasconcelos

nakfour commented 3 years ago

@yanniszark we plan to update the Openshift distribution, however not sure if one week will be enough, it will depend on the issues we run into.

rui-vas commented 3 years ago

Hi Yannis,

Thank you for leading this so effectively :)

For reference, I am only a small part of the Canonical team, @knkski @evilnick and @DomFleischmann are the stars :) I will create @kubeflow/canonical next week for simplicity in the future.

yanniszark commented 3 years ago

Guess you can add Kubeflow On-Prem/ArgoFlow/Argo CD to this list with https://github.com/kubeflow-onprem/ArgoFlow and status Ready.

@DavidSpek sounds awesome! I'd love to include the Argoflow distribution, but I want to ask first. Distributions are usually backed by vendors, who have a commercial interest in supporting and maintaining them. This ensures a good user experience and sufficient support. Who will be the owners of Argoflow? Is Argoflow something that its owners plan to maintain and support throughout the release and in later releases?

@yanniszark we plan to update the Openshift distribution, however not sure if one week will be enough, it will depend on the issues we run into.

Thanks for the reply @nakfour. I understand that issues may come up during testing, so let's communicate frequently on the status and decide accordingly. Could you also inform me if Red Hat is planning to support the Kubeflow Operator distribution? I believe Red Hat is the main stakeholder there, correct?

yanniszark commented 3 years ago

From Arrikto's side, we are dropping support for the kfctl_istio_dex distribution, so we should completely remove it from the docs. With the instructions provided by WG-Manifests , a user can deploy all Kubeflow components and common services, including Istio and Dex, using standard kustomize and kubectl. So, existing users should be able to just use these instructions instead.

davidspek commented 3 years ago

@yanniszark ArgoFlow is just using the upstream manifests (sometimes with a fix I have implemented), so the maintenance is all in the upstream manifests and Argo CD itself working. Basically, it is automating and simplifying the steps from the README in this repository. I don't think a distribution needs to be backed by a vendor, there can also be a community distribution. Having a commercial vendor doesn't necessarily mean there will be a good user experience, as I've seen very many unanswered issuer related to vendor distributions and their KfDef files.

I will be actively maintaining the ArgoFlow repository for this release, so you can list me as the owner. I think the on-prem working group wants to pick it up as well at some point.

nakfour commented 3 years ago

@yanniszark yes we are using the KF Operator for our KF distribution in Open Data Hub on Openshift, we are interested in taking over Kubeflow Operator distribution, however not at the moment. Maybe we can discuss this after KF 1.3 is released.

davidspek commented 3 years ago

@yanniszark I think it should be: Kubeflow with Argo CD. Also, status can be set to ready as it was built specifically for 1.3.

berndverst commented 3 years ago

FYI OIDC Auth Service Manifest is broken. PR here: https://github.com/kubeflow/manifests/pull/1805

davidspek commented 3 years ago

Cross-posting my comment here as well as it directly affects distributions.

@yanniszark After discussing with @Bobgy it might be a blocking issue for the distributions to not include the new Jupyter Web App which allow for spawning VSCode and RStudio notebook servers due to the use of their respective logos. I'll be looking into a mitigation route for distributions to avoid potential problems here, but they will need to be notified about this.

@Bobgy Also suggested to bring up that RStudio is licensed under the AGPL license, so the example image with RStudio might also be something distributions need to look into if they can include this or not. The above mitigation will also address this problem, but it is something distributions might want to look into separately.

Mitigation: The PR that solves this problem was just merged in https://github.com/kubeflow/kubeflow/pull/5823. All references to the trademarks for VSCode and RStudio have been removed from the UI and code. A new confimap was added to the deployment which contains allows users to set whatever SVG logos (for the Spawner UI) and icons (for the index page) they want for each Notebook Server Type.

nakfour commented 3 years ago

@yanniszark bumped into first issue here https://github.com/kubeflow/manifests/issues/1810

Bobgy commented 3 years ago

kubeflow/kubeflow#5816 of using namePrefix in upstream manifests make it harder to patch resources (because the expected name we use to patch resources become confusing, sometimes we need the prefix and sometimes not). It is a common inconvenience for us, I'd like to hear what others think about it.

Bobgy commented 3 years ago

@yanniszark we identified root cause for https://github.com/kubeflow/kubeflow/issues/5813, it should affect any distribution using profile plugins -- GCP and AWS, therefore I think it's a blocking issue we need to resolve before the release.

davidspek commented 3 years ago

The fix for the trademark issue described in my previous comment has been created in https://github.com/kubeflow/kubeflow/pull/5823.

Bobgy commented 3 years ago

Update for GCP, after resolving https://github.com/kubeflow/manifests/issues/1798#issuecomment-815437471, @zijianjoy and I got KFP and notebooks multi-user mode working on GCP. We are looking into other kubeflow applications.

karlschriek commented 3 years ago

@kubeflow/aws I would be happy to also test out the AWS distribution and help with any issues. This current code in https://github.com/kubeflow/manifests/tree/master/distributions/stacks/aws looks fairly old. Is there something more recent somewhere?

yanniszark commented 3 years ago

@kubeflow/cisco any update on the kfctl_istio_k8s distribution?

andreyvelich commented 3 years ago

@kubeflow/cisco any update on the kfctl_istio_k8s distribution?

From our side we will use the default Dex + OIDC installation for Kubeflow 1.3. I think we can deprecate kfctl_istio_k8s and kfctl_istio_dex. cc @ramdootp @amsaha

nakfour commented 3 years ago

For OCP we ran into this issue :https://github.com/kubeflow/kubeflow/issues/5803# , we have a workaround as described in the comment

davidspek commented 3 years ago

@yanniszark @PatrickXYS Seems like there is an issue with AwsIamForServiceAccount plugin for the profile controller. https://github.com/kubeflow/kubeflow/issues/5812

yanniszark commented 3 years ago

@DavidSpek I took a look at the issue and I believe it's a duplicate of https://github.com/kubeflow/kubeflow/issues/5813, which we have fixed.

nakfour commented 3 years ago

@yanniszark an update on OCP distribution, we are about 80% done, I dont think we will be done by Monday. I wanted to see if we can delay the KF 1.3 release since looks like most distributions on the list above are still not ready. If not, do we have a target date for KF 1.3.1 tag so we can have our code tagged? Also for next release, I wonder if we can do like a two tier release, one for KF and one a couple of weeks later for distributions. Just a thought, since with all the issues and a lot of components to test it takes longer time. Thanks

rui-vas commented 3 years ago

[Green light] - Charmed Kubeflow distribution: We have no pending issues with the manifests and expect to release our distribution within our typical 2 weeks timeframe of upstream to distribution release. @knkski and @DomFleischmann are leading this. cc @yanniszark @castrojo

moficodes commented 3 years ago

@yanniszark For IBM release for IKS I am about 95% done. Just a couple of minor changes and clean up. Should be done in a few hours.

PatrickXYS commented 3 years ago

@yanniszark From AWS side, we're pretty good, 90% done. Should be able to finish the PR by today or over the weekends.

Bobgy commented 3 years ago

@yanniszark I'm sending a PR to update KFP doc in manifests root README to resolve some confusions. https://github.com/kubeflow/manifests/pull/1851 Also, I think we should update KFP manifest version in the repo to 1.5.0-rc.3. Does manifest WG want to do that or should I create a PR? curious if you built any script to automate this

yanniszark commented 3 years ago

Thanks @Bobgy! I merged the README PR, thanks for taking the time to create that one. For upgrading the kfp manifests to 1.5.0-rc.2, I'd love to but I think we are too close to the release to do that. A lot of distributions would need to rebase and redo their testing, pushing the release further. I think we should put it in 1.3.1, along with other important changes like upgrading cert-manager and Knative. What do you think?

nakfour commented 3 years ago

@yanniszark OCP KF 1.3 distribution is ready, just pending review and merge of https://github.com/kubeflow/manifests/pull/1811 Also the operator at the moment does not need any specific changes to get KF 1.3 installed.

PatrickXYS commented 3 years ago

@yanniszark AWS EKS 1.3 manifest is ready, I'll find someone help review as well. https://github.com/kubeflow/manifests/pull/1832

Bobgy commented 3 years ago

Thanks @Bobgy! I merged the README PR, thanks for taking the time to create that one. For upgrading the kfp manifests to 1.5.0-rc.2, I'd love to but I think we are too close to the release to do that. A lot of distributions would need to rebase and redo their testing, pushing the release further. I think we should put it in 1.3.1, along with other important changes like upgrading cert-manager and Knative. What do you think?

UPDATE: I just released KFP 1.5.0, it's based on the same commit as 1.5.0-rc.3, I'd suggest use KFP 1.5.0 as the final release version.

The difference between KFP 1.5.0-rc.2 and 1.5.0-rc.3 is very minimal, most of the commits are either components or sdk. The only real changes are: https://github.com/kubeflow/pipelines/pull/5446, https://github.com/kubeflow/pipelines/pull/5408, https://github.com/kubeflow/pipelines/pull/5424 (5424 is an important bug fix).

So I'd say it's basically a drop-in replacement of KFP 1.5.0-rc.2 that we do not need to worry about re-integration.

@yanniszark sorry I keep getting confused of our timeline, from previous emails, I thought current release will be 1.3.0-rc.1. When is 1.3.0 planned?

Bobgy commented 3 years ago

@Bobgy Also suggested to bring up that RStudio is licensed under the AGPL license, so the example image with RStudio might also be something distributions need to look into if they can include this or not. The above mitigation will also address this problem, but it is something distributions might want to look into separately.

in https://github.com/kubeflow/manifests/issues/1798#issuecomment-814748504

I've been consulting with Google lawyers about RStudio being AGPL and haven't got the conclusive answer yet. GCP distribution might consider disabling RStudio support altogether. We are also confirming whether https://github.com/kubeflow/kubeflow/tree/master/components/example-notebook-servers/rstudio should be considered AGPL as well, which might break kubeflow/kubeflow's license declaration of Apache 2.0.

moficodes commented 3 years ago

IBM Distribution is ready https://github.com/kubeflow/manifests/pull/1823 Waiting for review and merge

yanniszark commented 3 years ago

Thanks @bobgy! I took a look at kubeflow/pipelines#5424 and it seems that it's very similar to a recent bug we fixed in KFP. I viewed the manifests diff, deployed them and tested them and all seems good with no incompatibilities. Thus, I will make a PR to include 1.5.0, since it has an important bugfix. Please take a look at: https://github.com/kubeflow/manifests/pull/1859

@yanniszark sorry I keep getting confused of our timeline, from previous emails, I thought current release will be 1.3.0-rc.1. When is 1.3.0 planned?

We released 1.3.0-rc.1 over the weekend, as per the email I had sent to the list saying that I would move forward with an rc1. I didn't send a separate email after the actual cut, which now seems I should have, so perhaps this is why there was confusion. The plan is to cut 1.3.0 today.

berndverst commented 3 years ago

There is insufficient information that explains which manifests or overlays must be used for multi-user and which must be used for single user Kubeflow.

Additionally, dependencies between components from apps and components from common aren't clear.

Further, the example provided by @yanniszark (https://github.com/kubeflow/manifests/blob/master/example/kustomization.yaml) deploys both OIDC auth service and a Dex Istio Overlay. Why? I thought people use either Dex or OIDC Auth Service (https://github.com/arrikto/oidc-authservice)? Is there a dependency between these? Can I safely remove the Dex overlay and OIDC Auth Service should still work?

I really need these things documented and explained properly before I (or anyone else contributing in their free time) can complete the Azure distribution.

https://github.com/kubeflow/manifests/issues/1873

davidspek commented 3 years ago

@berndverst Indeed, there is no information regarding single user deployments. Regarding the OIDC authservice and Dex, they are both necessary. The OIDC authservice is the OIDC client that while Dex is the OIDC provider. Dex could be replaced with another OIDC provider (keycloak or AWS Cognito for example) by changing the OIDC authservice configuration.

zijianjoy commented 3 years ago

Update: Kubeflow v1.3 on Google Cloud is available. Documentation is also updated: https://www.kubeflow.org/docs/distributions/gke/deploy/overview/

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

This issue has been closed due to inactivity.