kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.6k stars 698 forks source link

Move tensorflow/k8s to kubeflow/tf-operator #350

Closed jlewi closed 6 years ago

jlewi commented 6 years ago

Now that we have the Kubeflow GitHub organization should we

Reasons for moving the repo

Reasons for not moving

ConnorDoyle commented 6 years ago

+1 to move

Propose naming the repo kubeflow/tf-operator. The rationale for basing the name on the operator rather than the CRD it controls is that you could imagine more TF-related resources in the future.

yupbank commented 6 years ago

+1 to move

ScorpioCPH commented 6 years ago

+1 And how about naming it kubeflow/tf-controller :)

gaocegege commented 6 years ago

+1, and +1 for the name kubeflow/tf-controller

k82cn commented 6 years ago

+1 to move :).

kubeflow/tf-controller seems better for k8s's style :).

jlewi commented 6 years ago

Per tensorflow/k8s#300 kubeflow/tf-operator seems better than kubeflow/tf-controller

gaocegege commented 6 years ago

tf-operator SGTM, too.

pineking commented 6 years ago

+1

@jlewi what's the status of this repo (tensorflow/k8s) after moving to kubeflow/kubeflow? freezing? deleted?

gaocegege commented 6 years ago

@pineking This repo would be moved to kubeflow/tf-operator, not kubeflow/kubeflow, I think. It will be transferred so when you visit github.com/tensorflow/k8s it will be redirected to kubeflow/tf-operator.

gaocegege commented 6 years ago

I think there are some tasks we should do:

And, there will be some conflicts with most of PRs, so I think it may be better to squash the mergeable PRs ASAP.

jlewi commented 6 years ago

Does it matter whether we do the rename before or after we move the repo?

I suggest we don't update the name in kubernetes/test-infra until after the move so that our tests continue to work.

gaocegege commented 6 years ago

@jlewi SGTM, I think it does no matter 😄

jimexist commented 6 years ago

🚀 no churn no gain, early churn << later churn.

jlewi commented 6 years ago

I've started working on #205 (replacing Airflow with Argo for CI). My suggestion is we wait until we've migrated off Airflow to Argo for testing to do the move. Debugging tests is so painful with Airflow that I'd rather migrate to Argo first as I think this will make dealing with the churn and breakages much easier.

jlewi commented 6 years ago

I submitted #205.

I'll plan on moving this repo to kubeflow/tf-operator early this week

@wbuchwalter @DjangoPeng @gaocegege @ScorpioCPH @Jimexist @zjj2wry as committers to tensorflow/k8s can you explicitly LGTM/+1 this change?

DjangoPeng commented 6 years ago

+1 for moving it to kubeflow/tf-controller.

gaocegege commented 6 years ago

+1 Thanks for your work 😄

I'll plan on moving this repo to kubeflow/tf-operator early this week

There is one thing: we do not have the role committer in our governance proposal.

wbuchwalter DjangoPeng gaocegege ScorpioCPH Jimexist zjj2wry as committers to tensorflow/k8s can you explicitly LGTM/+1 this change?

jlewi commented 6 years ago

@DjangoPeng @gaocegege Seems to be some disagreement over the name.

Based on the feedback in this issue kubeflow/tf-controller seems more popular but based on tensorflow/k8s#300 kubeflow/tf-operator seems like the more accurate terminology.

Specifically, this comment says that an operator is a controller that is application specific; whereas controllers are not application specific.

In our case, we are very much application specific so operator seems like the better terminology (albeit its just a convention).

Technically, all operators are controllers but I think we should follow CoreOS and use operator to indicate we are application(TF) specific.

@DjangoPeng @gaocegege what do you think?

gaocegege commented 6 years ago

I agree with the name tf-operator

DjangoPeng commented 6 years ago

@jlewi @gaocegege Operator is more powerful than controller in my mind. If we move it to kubeflow/tf-operator, we'd better clearly define the scope of tf-operator. For example, it can manage the whole lifecycle of distributed tensorflow jobs and provide many termination policies such as restart, restore, stop, etc.

BTW: Considering the development of kubeflow, there are mxnet-operator, caffe-operator and pytorch-opperator in the future. So, we'd better manage the corresponding CRDs in an unified repo such like kubeflow/clientset. I'm not sure it's the time to do that or not.

gaocegege commented 6 years ago

@DjangoPeng

Yeah, I think this repo is an operator, since it does manage the whole lifecycle of distributed TF jobs. I agree that it is better to place all clientset and CRD in one repo, but it is not in high priority, IMO.

jlewi commented 6 years ago

It seems like we're ok with kubeflow/tf-operator for now. I'll add this to the community meeting agenda just to ensure there's no final objections.

I agree with @DjangoPeng that if we add more operators in the future we might need to rethink code organization but I say we cross that bridge when we come to it.

I opened #179 to gauge interest in PyTorch. Anecdotally PyTorch is the deep learning framework that seems most popular after TF.

gaocegege commented 6 years ago

@jlewi The issue should be https://github.com/kubeflow/kubeflow/issues/179 😄

And FYI, there is a third party MXNet operator inspired by this repo: https://github.com/deepinsight/mxnet-operator. But it follows our old architecture, and I think we should refactor it to be event-driven or reimplement it if we want to support MXNet.

/cc @loadwiki

DjangoPeng commented 6 years ago

Cool!

jlewi commented 6 years ago

Decision at the morning meeting was to move the repo to kubeflow/tf-operator I think I'd like to get an E2E test that verifies kubeflow is working (kubeflow/kubeflow#207) before we move things to try to avoid getting in the state where things are broken at head.

jlewi commented 6 years ago

@gaocegege I moved the repository it is now kubeflow/tf-operator. Can you take on putting together a PR to update all the locations in the code?

gaocegege commented 6 years ago

@jlewi I will do it.

gaocegege commented 6 years ago

PTAL #386 And I think we should setup Travis and Prow for this repo.

jlewi commented 6 years ago

kubernetes/test-infra#6779 defines the prow jobs for the new repo. That should hopefully get merged at which point we can figure out what's broken and needs to be fixed.

jlewi commented 6 years ago

https://prow.k8s.io/?repo=kubeflow%2Ftf-operator

It looks like the prow job was triggered correctly for kubeflow//tf-operator but the results weren't written back to the PR.

jlewi commented 6 years ago

The k8s-ci-robot has write access on the repo which is what we have for kubeflow/kubeflow

jlewi commented 6 years ago

Looks like the prow job can't pull the image

Log not found: response has status "400 Bad Request" and body "{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"container \"4fea41a7-1041-11e8-b419-0a580a6c0122-0\" in pod \"4fea41a7-1041-11e8-b419-0a580a6c0122\" is waiting to start: image can't be pulled","reason":"BadRequest","code":400}
gaocegege commented 6 years ago

It seems that there is only one thing to do:

jlewi commented 6 years ago

@gaocegege What name are you referring to?

gaocegege commented 6 years ago

https://github.com/kubernetes/test-infra/blob/master/prow/plugins.yaml#L13 @jlewi

jlewi commented 6 years ago

We have already enabled it for the org https://github.com/kubernetes/test-infra/blob/master/prow/plugins.yaml#L23

So I'm going to mark this bug as closed. It would still be good to clean up the reference to tensorflow/k8s https://github.com/kubernetes/test-infra/blob/master/prow/plugins.yaml#L13