kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.
Apache License 2.0
51 stars 73 forks source link

Unified training operator working progress #138

Open Jeffwan opened 3 years ago

Jeffwan commented 3 years ago

@zw0610 and I present all-in-one training operator proposal in last month community meeting.

WG-Training leads have already agreed to move forward. This issue is created to track implementation progress. The desired alpha release of this new unified operator will be Kubeflow 1.4

Configuration and deployment

Description Category Status Issue
Kustomize package Required Done  
Application CR Required Not Done  
Images listed in kustomization.yaml Required Not Done  
Upgradeability Required Not Done  
Separate cluster scoped and namespace scoped resources Recommended Not Done N/A
Kustomize package should be deployable on its own Recommended Done  Need to coordinate with 1.4 release

Custom Resources

Description Category Status Issue
Version stability Required Not Done  
Backward compatibility Required Not Done  
Supports status subresource Required Done  All jobs have status to reflect the real status
CRD schema validation Required Not Done
Training operators follow kubeflow/common conventions Required Done https://github.com/kubeflow/tf-operator/pull/1296 https://github.com/kubeflow/tf-operator/pull/1295 https://github.com/kubeflow/tf-operator/pull/1294 https://github.com/kubeflow/tf-operator/pull/1293

Observability

Description Category Status Issue
Liveness/Readiness signals Required Not Done  
Prometheus metrics and Graphs Required Not Done
Job Events Required Not Done  
Json logging Recommended Not Done  

CI/CD

Description Category Status Issue
E2E tests Required Not Done  
Scalability / load testing Required Not Done  
Continuous building of docker images Recommended Not Done  https://github.com/kubeflow/testing/pull/951
Continuous updating of Kustomize manifests Recommended Not Done  This is not valid anymore - kubeflow/manifests will fetch repo's kustomize manifest

Docs

Description Category Status Issue
API Reference docs Required Not Done  
Application docs Required Not Done  

Owners/Maintenance

Description Category Explanation Status Issue
Healthy number of committers and commits Required Committers are listed as approvers in owners filesNumber to be determined by TOC based on size and scope of application Not Done  
At least 2 different organizations are committers Required Not Done  

Adoption

Description Category Explanation Status Issue
List of users running the application Recommended Suggest listing adopters willing to be identified publicly in ADOPTERS.md Not Done  
Jeffwan commented 3 years ago

Things to figure out.

  1. code repo process, project name -> confirm with Bobby.
  2. tech stack? Kubebuilder version, Kubernetes version etc
  3. integration environments - Prow or Github Actions, Where to hold the images? Andrey
  4. API version management & clientset generation
  5. Development cycle
Jeffwan commented 3 years ago

An update on above items. @zw0610 @kubeflow/wg-training-leads

  1. code repo process, project name -> confirm with Boggy.

reuse tf-operator and rename to kubeflow/training-operator. pending confirmation with Boggy.

all issues, commits, followers, start will be transferred to new repo.

  1. tech stack? Kubebuilder version, Kubernetes version etc

kubernetes 1.19.x kubebuilder 3.0.0 controller-runtime v0.7.2

  1. integration environments - Prow or Github Actions, Where to hold the images?

reuse our PROW test jobs in all-in-one-operator branch. use AWS public images and CD for short term.

  1. API version management & clientset generation

Start from v1 API since we plan to reuse most of the existing specs in phase 1. clients generation will be postposed until we see some other repos want to leverage it.

  1. Development cycle

use tf-operator separate develop branch (July 16) -> when features are all ready, merge back to master (2 weeks review by training leads) -> clean up code base (1 week) -> rename the repo (1month and catch 1.4 release)

We plan to have an alpha rc release by training & automl summit. (July 16).

andreyvelich commented 3 years ago

Thank you for driving this @Jeffwan!

kubernetes 1.19.x kubebuilder 3.0.0 controller-runtime v0.7.2

Is there any limitation why we need to use Kubernetes 1.19 ? Can we just jump to 1.20 or even to the latest 1.21 version ?

clients generation will be postposed until we see some other repos want to leverage it.

Does it mean that we also drop SDK support ? Or we are talking only about clientset, listers, informers ?

Jeffwan commented 3 years ago

Is there any limitation why we need to use Kubernetes 1.19 ? Can we just jump to 1.20 or even to the latest 1.21 version ?

Yeah, this is flexible. Since current repo use lower version. We plan to have a 1.19 as a start and then jump to 1.21 once we merge back to master. Just in case someone user lower version and we want to have a tag or release for those users.

Does it mean that we also drop SDK support ? Or we are talking only about clientset, listers, informers ?

Yeah, you are right. Python SDK will be supported. I mean clientsets. controller itself use higher level client and doesn't need clientsets. BTW. does Katib use them?

andreyvelich commented 3 years ago

Sounds good @Jeffwan.

Yeah, you are right. Python SDK will be supported. I mean clientsets. controller itself use higher level client and doesn't need clientsets. BTW. does Katib use them?

No, we are only using APIs from the TFJob: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/experiment/validator/validator.go#L28 to validate TFJob, etc. But this also can be omitted from our side since it's not necessary. cc @kubeflow/wg-automl-leads

johnugeorge commented 3 years ago

@Jeffwan Great. Can we merge code in phase as review will be easier?

Jeffwan commented 3 years ago

@johnugeorge sure. I will cc all training leads for PRs coming into feature branch.