kubeflow / manifests

A repository for Kustomize manifests
Apache License 2.0
825 stars 886 forks source link

Improve repo structure and delete outdated manifests #1554

Closed thesuperzapper closed 3 years ago

thesuperzapper commented 4 years ago

This repository is extremely messy and is a horrible user experience. The mess has caused us real issues, as we botched the 1.1 release due to a lack of visibility of what is even in the manifests (See: #1553).

When paired with the almost complete lack of docs relating to Kubflow 1.1 this leaves Kubeflow near impossible to install, just take a look at recent-issues (across all the repos), and our slack and you will see what I mean.

My proposal is to make a new manifests repository and make some changes as we do, we should:

  1. seperate each of the 'distributions' into a seperate repository (similar to gcp-blueprints),
  2. only have the actual kubeflow manifests in the core manifests repo, (that is, no Istio and cert-manger, knative, as they are ostensibly part of a distribution, not Kubeflow)
  3. actually keep the docs up to date

EDIT: see this comment below for my proposal of a 'Generic' distribution (to replace most of the current ones)

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.60
platform/gcp 0.92

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

jlewi commented 4 years ago

I think this is relevant to the discussion in kubeflow/community#402; particular clarifying the responsibility between application owners, the proposed deployments WG and the platform owners.

I think the lack of structure, and code accumulation in kubeflow/manifests is an outcome of lack of clear ownership/responsibilities.

thesuperzapper commented 4 years ago

/priority p1

thesuperzapper commented 4 years ago

I also believe we longer need the concept of a 'distribution' (in the form we currently have it), as this was primarily made to support GCP deploying a full K8S stack with Kubeflow on top.

If you look at Slack/GitHub-Issues you will see most users are installing Kubeflow on existing clusters of all types: GKE, AKS, EKS, native-k8s. We need to focus on supporting them!

To facilitate this, I propose we:

  1. Do the above migration to a new kubeflow/manifests for all ACTUAL Kubeflow manifests
    • INCLUDING: pipelines, jupyter-web-app, etc
    • NOT INCLUDING: authentication (like dex), istio, argo-workflows, cert-manger, knative, etc
  2. Consolidate all distributions (that are willing) into a 'Generic' one:
    • this distribution would:
      • assume the user at least already has a K8S cluster running
      • not depend on any vendor specific feature
      • live in its own git repo (as proposed above)
      • include everything you would need to deploy Kubeflow on an empty cluster: auth (dex), istio, cert-manger, argo-workflows, knative, mysql, etc
        • NOTE: we would let users pick the pieces they need (potentially with kfctl)
        • NOTE: we would clearly document which versions/configs of these tools are supported (if they don't choose to install them with Kubeflow)
      • get its actual Kubeflow manifests from the new kubeflow/manifests repo (as proposed above)
      • be intended to be used in Git-Ops, (but not tied to any specific git-ops tool)
    • I believe this is good for EVERYBODY:
      • as currently, all distributions are pretty much the same
      • vendors can still validate this distribution for their customers
      • the overhead for releasing a new Kubeflow version is reduced to a single place
      • it means that Kubeflow is an OPEN STANDARD (which can be run anywhere)
      • users will actually be able to install/upgrade Kubeflow!
mameshini commented 4 years ago

Agile Stacks distribution of Kubeflow is available here: https://github.com/agilestacks/stack-ml-eks It is open source and it can be merged with Kubeflow repository in the future. We have taken a path of GitOps and customized the Kubeflow manifests to get more control over all components that are part of Kubernetes platform: dex, istio, letsencrypt, cert-manager, mysql, knative, ceph, csi plugin. The above example is available for AWS EKS, and we are working on the similar distributions for GKE, AKS, on-prem.

thesuperzapper commented 4 years ago

Another option (instead of Part 1) is that we discontinue the manifests repo entirely, and store the manifests for each component in the same repository as the component code.

However, I see a few issues with that approach:

alfsuse commented 4 years ago

What if we do split non-Kubeflow components and group them into a sort of "extension/plugin" repo? This way we would have the following:

This way the experience would result in a simpler and lighter installation for the core components plus a specific customized part based on requirements. Resources/configs would be part of the extension configuration as patches or overwrite to the core components.

erikerlandson commented 4 years ago

a bit of a nit, but what is the benefit of per-release repo, compared to just branching/tagging for releases?

Jeffwan commented 4 years ago

I don't think we need to refactor the entire repo. This repo has been used since 0.6. We have so many breaking changes in each version. If user have some problems, we can address them separately. The major reason it looks messy recently is because components growing and v3 stack migration.

Seems all the problem you describe can be resolved by having a KFDef manifest with minimum components? From user's point, they are not supposed to look at all components, the entry point should always be the KFDef or other manifest files.

jzhoucliqr commented 4 years ago

I totally agree that we need better support for existing clusters. From what I can tell, most of the errors happen when there are conflicts with components already installed on existing cluster, like cert-manager. For this case, the admin who installs kubeflow does need to look at all components to do customizations.

On the other hand, I'm thinking maybe another reason we only see issues when install into existing clusters is because create a brand new cluster and install k8s mostly goes smoothly?

connorlwilkes commented 4 years ago

On this topic, is there a clear list of what components are core, what components are optional and of those components which are managed by Kubeflow and if they aren't what versions of the external components are compatible with which version of Kubeflow?

rmgogogo commented 4 years ago

/cc @Bobgy

thesuperzapper commented 4 years ago

I am seeing some confusion about why we need to create a new repo, rather than just use this existing one.

It's because this new repo will be completely different in both code and layout. It will ONLY HAVE Kubeflow component manifests, which are organised cleanly into folders. Distributions like GCP/AWS would live in their own repos, but the Kubeflow project would only be responsible for the 'Generic distribution'.

Given these changes are so significant, if we want to keep current releases (Kubeflow 1.0 and Kubeflow 1.1) available, we need a new repo.


The main outcomes we can achieve with this refactor:

  1. The release process is massively simplified:
    • we only depend on the 'Generic distribution' to be ready for release
    • non-Kubeflow components don't get in the way (as they live in the 'distribution' repos)
  2. We can make patch releases:
    • currently the release-model is too cumbersome for 1.1.1 type releases
  3. Kubeflow component maintainers can safely update their manifests:
    • currently, it's not clear how to safely upgrade YAML because everything is such a spider-web of dependancies (See: #1553)
veggiemonk commented 4 years ago

It's all good points @thesuperzapper made. I agree :+1:

It's best for the project to focus on KubeFlow rather than half the CNCF Landscape + KubeFlow.

Also, my experience on trying to install KubeFlow on GKE was a total disaster as I did not need half of what was in the manifest and the other half, I'm still trying to figure out why do I need this ?

As far as I understood, Kubeflow is an architecture that gathers other projects as plugins.

It would be nice to allow people to understand which project manifest is tied to which features instead of asking people to install a custom build tool (kfctl) that spits out 40+k lines of YAML, hoping that they will suddenly understand what everything is doing.

I love the pick and choose approach but if I don't understand what I'm choosing, is the goal reached ?

I would be so happy to have a way to deploy KubeFlow on a kubernetes cluster with Istio already installed because I would understand what am I installing.

alfsuse commented 4 years ago

To @veggiemonk and @thesuperzapper comments, when a user installs KF over an existing cluster may have specific requirements like Istio or Dex versions so the need of customize the manifests before time to add/remove/change is there and from the UX experience is not that great because doing that way rise the risk of human errors a lot.

I personally like @Jeffwan approach here

Seems all the problem you describe can be resolved by having a KFDef manifest with minimum components? From user's point, they are not supposed to look at all components, the entry point should always be the KFDef or other manifest files.

But so I think we need to define a list of:

swiftdiaries commented 4 years ago

To give some context for the approach with kfctl and KFDef, initially we had a ksonnet based deployment that configured Kubeflow in an entirely transparent manner. By that what I mean is that it was multi-step and you could modify and customize individual components in any way that a user wanted to.

There was some friction with the tooling itself and how it used ksonnet/jsonnet over helm / plain yaml (which was what the k8s community was used to). And so kfctl.sh - a bash script to automate some of that into three broad steps:

That change was to paper over some of the difficulties in understanding ksonnet/jsonnet and abstracting away the tool's complexity with a layer on top. Over time the bash script accrued some debt and it was decided to move the bash script to a go binary. Around the same time, ksonnet was archived. And our move to kustomize wasn't easy because the feature-set wasn't a 1-1 mapping. So we had to build some custom logic into the kfctl go binary and structure the manifests repo in a way that wasn't "standard". As with any project, we had deadlines to meet and were dealing with a deprecated/archived project and so we cut corners to deliver a "working" implementation.

Between this changes, we lost the ability to configure individual applications and kfctl w/ manifests became the only way to deploy Kubeflow.

After this, we introduced Istio as part of the stack to move from Ambassador as the API Gateway. This meant that we now had two "versions/flavors" of Kubeflow. One without Istio and one with Istio and transition over to Istio. This introduced the concept of kustomize overlays. But we also needed to order the install now because unless Istio and it's CRDs are installed first, the applications themselves using Istio CRDs (Virtual Services) would fail. This was done with the help of the application controller and meta-controller (I think). But this meant that we needed two overlays. One for Istio-related virtual services and one for ordering the applications with the application CRD. But kustomize at the time didn't have the feature to merge multiple overlays and order applications. So we introduced the concept of a KFDef to flesh out information on what overlays need to be merged together and order the application installs (Istio first, tf-operator next and so on...).

Now, notice how we started out with something to paper-over complexity by simply using a bash script that calls the individual CLIs (ksonnet, kubectl...) and to replace that with a maintainable Go binary but ended up adding more complexity. This was not intentional but driven by the need of the hour and being put on the spot with difficult choices. Maybe we could've made better choices but here we are.

Anywhoo, my point is ANY project OSS or otherwise, accrues debt over time for whatever reason. And I'm seeing a rising amount of rants in comments and issues, these are NOT helpful at all. Be kind. Offer something constructive. This is an open community. So open a PR, propose a design. For example, this repo is a mess -> rearchitecting manifests to remove tech debt or something would go a long way. The maintainers did their best at the time to take time to contribute and build this project into what it is today.

swiftdiaries commented 4 years ago

Now to addressing the issue, I see two requests:

I think the work the done with v3 kustomize is a good starting point for us to build towards manifests which are configurable.

And also, there were will always be people who want a simple installation process. I think in the past we went for a "low bar, high ceiling" approach where (most) people can install with a simple set of commands. And a set of people (power users) who want to customize / modify / extend. And we want to enable all users to be successful with Kubeflow.

[high ceiling] In order to work with existing installations of Istio+Dex, we need to make applications more modular. [low bar] We need to make the default install experience more reliable.

jlewi commented 4 years ago

I also believe we longer need the concept of a 'distribution' (in the form we currently have it), as this was primarily made to support GCP deploying a full K8S stack with Kubeflow on top.

The concept of a "distribution" arises because a one size fits all (or even most) approach won't work.

One of the problems we encountered early on is that trying to create a configuration that works every where leads to a suboptimal experience.

A simple example, not every K8s cluster has a default storage class. So how should we configure Jupyter by default? Should jupyter use a default storage class to provide durable storage for notebooks? Which will lead to a non functional deployment on clusters without a default storage class. Do we use ephemeral storage which leads to data loss if the pod is preempted?

Multiple distributions are necessary in order to prevent regression to the lowest common denominator.

Distributions allow us to embrace diversity of use cases and oppinions by trying to make it easy for folks to create opinionated distributions of Kubeflow rather than try to achieve consensus.

Even for individual applications there is a growing number of ways to configure them; e.g.

Likewise everyone's existing clusters will be configured differently

It would be great if Kubernetes had a mechanism for application dependency management but it doesn't (or at least not one considered to be a standard).

Creating such a dependency management system is outside the scope of Kubeflow.

Distributions are a way for folks to try to create an oppinionated solution to that problem.

As an example, IBM is creating a distribution for Kubeflow based on OLM in part because OLM

With respect to upgrades I would suggest looking at kubeflow/kfctl#304 for some helpful context about some of the tech debt @swiftdiaries is referring to that is impeding upgrades.

Likewise kubeflow/manifests#1007 provides context about vars and the v3 migration. vars originated as a way to handle necessary customizations for different Kubernetes installs; e.g. the ClusterDomain.

Per @swiftdiaries's suggestion above a great way to clean up tech debt would be to figure out if we can start removing the old legacy manifests.

Regarding outdated images kubeflow/manifests#1553 this looks to me like a process issue; in particular ensuring application owners are ensuring their manifests are up to data. I think this is something we should try to address as part of the wg-deployments charter (kubeflow/community#402) by clarifying responsibilities and expectations.

thesuperzapper commented 4 years ago

@jlewi

Just a few points:

Finally, do you really think that the current structure of this repo is acceptable? I think starting again is the best option, with clear requirements for basic things like preventing spider webs of Kustomize scripts.

thesuperzapper commented 4 years ago

@jlewi Also Regarding outdated images (See: #1553), at least part of this issue was clearly caused by the fact that component owners are overwhelmed by the current state of this repo

jlewi commented 4 years ago

Repositories and projects need OWNERs. Any new project/repo would require WG sponsorship. The most likely WG would be the deployments WG. So any progress on a new repo is blocked on formation of that WG.

Of course anyone is free to fork or create their own repo outside the Kubeflow org.

In the interim, we can make incremental progress towards overhauling the structure of this repo through a series of incremental refactorings.

@thesuperzapper given you have expressed an interest in notebooks the jupyter manifests seem like an obvious starting point.

yanniszark commented 4 years ago

@thesuperzapper thanks for putting this together!

I agree that the current manifests repo leaves a lot to be desired. We have managed to make it work by creating an end-to-end pure GitOps process building on top of this repository, but not depending on it, and this is how we are now building both MiniKF and our multi-node, enterprise deployments. But note that this was a significant effort that we had to undertake internally.

With the deployment working group just formed, I believe we should use this momentum and tackle this as the first project of wg-deployment.

In my experience, some major pain points of the current manifests are:

This will also allow us to clean up a lot of the old code and practices and end up with a much more focused repo. So my proposal is to:

  1. Create wg-deployment by merging (https://github.com/kubeflow/community/pull/402)
  2. Create the meetings for wg-deployments
  3. Discuss this as our first project
PatrickXYS commented 3 years ago

Let fix this before 1.3

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.

thesuperzapper commented 3 years ago

I would say after https://github.com/kubeflow/manifests/issues/1735 was merged, this issue is mostly resolved. /close

google-oss-robot commented 3 years ago

@thesuperzapper: Closing this issue.

In response to [this](https://github.com/kubeflow/manifests/issues/1554#issuecomment-859311723): >I would say after https://github.com/kubeflow/manifests/issues/1735 was merged, this issue is mostly resolved. >/close > > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.