[Request] Repository for Arena

cheyang commented 6 years ago

Move repository from https://github.com/AliyunContainerService/arena to KubeFlow community.

/assign @jlewi

jlewi commented 6 years ago

/cc @gaocegege @kkasravi

gaocegege commented 6 years ago

Personally, I love the tool. I haven't tried it but I looked through the code. I think it improves the usability from CLI level. We have some discussions about Kubeflow CLI. arena is what I want to get from the Kubeflow CLI. Thus I think we could accept the contribution from @cheyang and make it a core project in Kubeflow community. I have a communication with @cheyang and he will continue to contribute to the project.

What I am worried about is the copyright. @cheyang Will you just move the repo to Kubeflow org or you could transfer the code copyright and ownership to Kubeflow community? And will your company allow the transfer?

cheyang commented 6 years ago

Thanks for the response of @gaocegege . And we are in processing of approve for copyright. Looks good till now. When it's approved, we can transfer it to KubeFlow community.

gaocegege commented 6 years ago

Then LGTM.

jlewi commented 6 years ago

This is really cool and has important functionality like tooling to help fetch logs.

I took a look at the repo and I have a couple high level questions

What is the scope of Arena?
How do we align Arena with Kubeflow?

Regarding #1, the overview says

Arena is a command-line interface for the data scientists to run and monitor the machine learning training jobs

But the Roadmap is much broader; for example it mentions multi-tenancy and training history management. How do these functions overlap with the idea of a CLI? Why wouldn't we tackle these problems in the broader context of Kubeflow rather than in the context of a CLI?

Regarding 2.

How are we going to align Arena with Kubeflow? One of the core principles of Kubeflow is that we don't introduce new patterns and tools for things already covered by Kubernetes. We also don't want to create Kubeflow specific solutions to problems that should be solved by Kubernetes. Do the Arena contributors agree with this principle?

Here are some examples

Managing nodes e.g arena nodes seems like core Kubernetes functionality handled by kubectl; do we need to build a custom container for this?
One of the problems Arena is trying to solve is making it easy to go from source to job? Source to image is a general problem in Kubernetes and there are numerous tools evolving to solve this and we have discussed trying to leverage them
- kubeflow/tf-operator#136 Support draft/skaffold for packaging
- kubeflow/kubeflow#1240 Better Jupyter integration for TFJob
- kubeflow/kubeflow#465 GitOps and dev tooling

Related discussions:

Configurable Dev Tooling https://github.com/kubeflow/community/pull/54

Lots of discussion about logging

Discussions of multi-tenancy and ACLs https://github.com/kubeflow/community/pull/124 Scope to TFOperator to namespace https://github.com/kubeflow/tf-operator/issues/759 Create a historical record of objects https://github.com/kubeflow/community/pull/46

cheyang commented 6 years ago

Thanks very much for sharing your ideas and a lot of helpful information with us! Before answering your question, I want to clarify the our goal and principle:

As part of the KubeFlow, the goal of Arena:

Bring KubeFlow to data scientists without the challenge of learning kubernetes.

The Princlple of Arena:

Ease the data scientist’s work on kubernetes, and empower them by using the domain knowledge built in Arena
Build on KubeFlow community’s solution

The first principle is the key. We are trying to solve the customer’s requirements and issues.

Here is our answer to the questions:

What is the scope of Arena?

Arena is just a command line and façade of KubeFlow capabilities. And we want to leverage the existing capabilities of KubeFlow to support historical record and multi-tenancy instead of creating it by ourselves. But if there are no one pick up these features, we will be volunteers to solve. Of course it's not in Arena, but in KubeFlow. Because our customers are asking for them, we should make it happen.

How do we align Arena with Kubeflow?

We agree the core principles of Kubeflow is that not introduce new patterns and tools for things already covered by Kubernetes. And we will align to this. The way to check this is that we will open features and issues before checking code. Do you think it’s fine？

For the specific question:

Managing nodes e.g arena nodes seems like core Kubernetes functionality handled by kubectl; do we need to build a custom container for this?

arena top node just helps the user to identify the the number of allocated GPUs and which pods have GPUs. we didn’t find existing Kubernetes features meet such requirements. I think checking GPU related metrics are strong requirements for the data scientists but not for the kubernetes users. The GPU is not the common resource like CPU and Memory, because of the device plugin mechanism.

One of the problems Arena is trying to solve is making it easy to go from source to job?

Not exactly. The data scientists don’t want to put the training code inside the docker image. The reason is that the base machine learning image(TensorFlow, CUDA, NCCL) is always changing, and they want to try different base images for experiment(check performance). And they really don't want to take care of this part. In one word, they hate to build and maintain docker images. It’s quite different from the DevOps scenario.

jlewi commented 6 years ago

Bring KubeFlow to data scientists without the challenge of learning kubernetes.

This is a great goal.

If this is the goal why emphasize a CLI based approach? The request I hear most from datascientists is

Give me a notebook centric experience
- In particular make it possible to train and deploy a model from a notebook
Give me a UI

Would do you think about moving Arena in one of the directions listed above?

My conjecture is that we can deliver a more data scientist friendly experience by focusing on notebooks; e.g. by making Arena

A high level Python library for working with Kubeflow that hides Kubernetes details
Creating Jupyter UI plugins to allow datascientists to visualize output/monitoring data
Extending the existing TFJob's UI; e.g. to allow users to upload code and automatically kick off a job

I think if we build a good python library that could provide a native notebook experience and then be used as the basis for a CLI.

/cc @wbuchwalter @kkasravi @pdmack

cheyang commented 6 years ago

/cc @wsxiaozhang @denverdino @cuericlee

cheyang commented 6 years ago

If this is the goal why emphasize a CLI based approach?

It’s based on our experiences of the customer engagement . We have many AI customers which are from Internet companies, Research, banks and etc. We have found they are familiar and prefer CLI in linux terminal rather than notebook. I think it may be the habit of data scientists of our customers. We tried to promote jupyterhub/notebook to our customers, but they preferred to CLI solutions through linux terminal. Because it's what they are doing today. That’s why we delivered Arena. I also see Caicloud has the same feedback from their customers.

Would do you think about moving Arena in one of the directions listed above?

we don’t have such plan now because most of our customers is fine with the CLI solution, but if there are more customers asking for the UI, we will provide the UI for them.

kkasravi commented 6 years ago

We've gotten data scientist feedback that they also like CLI's with the ability to customize their CLI using python. We should, if possible in this discussion, qualify areas where a UI may be preferred vs iterative development such as training a model. In the latter we've been told that data scientists have wanted to automate aspects of their workflows. We've also gotten feedback from some data scientists that working in a terminal vs a notebook in a browser is preferred due to the higher latencies of typing in a browser.

kkasravi commented 6 years ago

BTW we spent some time on an earlier effort that was also based on spf13/cobra but decoupled the command from its execution which was done with serverless functions that could be written in python. For this we used kubeless. We spent some time making the command set extensible, so you could add, remove commands. Looking at your codebase it looks like many similar ideas are implemented using the kubernetes clientset - which kubeless utilizes under the covers. @jlewi is @kunmingg planning on extending gcp-click-to-deploy so other commands would be sent to bootstrapper for execution? I know that there is an active effort to unify kfctl.sh and gcp-click-to-deploy but wasn't sure if this extended beyond the deployment of components into areas that arena has focused on.

jlewi commented 6 years ago

I see a lot of value in CLIs. The question I have is how will Arena evolve compared to generic CLI's in K8s e.g.

kubectl
helm
ks
draft
skaffold

Lets take an example

arena submit tf --name=tf-dist-git              \
              --gpus=1              \
              --workers=2              \
              --workerImage=tensorflow/tensorflow:1.5.0-devel-gpu  \
              --syncMode=git \
              --syncSource=https://github.com/cheyang/tensorflow-sample-code.git \
              --ps=1              \
              --psImage=tensorflow/tensorflow:1.5.0-devel   \
              --tensorboard \
              "python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --logdir /training_logs"

It looks like this is doing two things

Its providing a wrapper around this helm package for TFJob
its injecting code into the image using an init container to run git

Isn't this the sort of workflow that tools like draft and skaffold are targeting?

It looks like the chart is trying to turn the entire YAML spec into a set of parameters.

Why not just publish a helm chart and use helm as the CLI? Why wrap helm in another, custom CLI?

Rather than create really complex templates as in the TFJob chart why not just create a set of example templates/charts that cover different use cases and encourage people to create templates specific to their needs?

Creating generic templates is really hard. I think we'll just end up adding more and more parameters to cover more use cases and eventually rewriting the API. Looking at the helm charts it seems like that's what's happening. So instead of telling people to look at the APIs for our CRDs(TFJobs, MPIJob, e.tc...) to figure out how to set something (e.g. an environment variable) they need to look at the chart and reverse engineer the Jinja. How is that better than just pointing folks at the Container Spec in the K8s docs?

@kkasravi @kunmingg isn't working on extending bootstrapper to perform other commands.

Ref: kubeflow/kubeflow#465 Dev tools kubeflow/tf-operator/#136 Draft for packaging

jlewi commented 6 years ago

It would be good to discuss this at one of the community meetings. Unfortunately, I'm not sure I will be able to attend tomorrow's meeting and two weeks (next meeting for Asia Pacific Time zone) I will be on vacation.

cheyang commented 6 years ago

Thanks, no urgent. We can discuss when you are back. Have a good vacation.

cheyang commented 6 years ago

In fact, we have already provided some helm charts for TensorFlow and Horovod in https://github.com/helm/charts/tree/master/stable. And we tried to help the customers to use chart to cover the model development, training and serving.

Notebook distributed-tensorflow horovod serving

Our customers used them, but they thought they are too complicated for them. If There are too many choices and options to the customers, they feel confused and don’t like to know too many details. And they dislike to use both helm and kubectl, they want to use a single CLI to handle their daily work.

According to our experience, the data scientists only cares three questions:

where to get the data and the source code
how to run the distributed training easily
how to check the logs and tensorboard easily

And our wrapper is trying to answer the questions above and avoid exposing details. Arena is the CLI facade with machine learning domain knowledges, it does not only submits the training job, also manages the lifecycle of the job, it can get the status of the job, check the logs directly. The ordinary users can use arena directly without understanding charts. It's easy for them to get started.

If the advanced users need to add more features, they can modify the chart directly.

The similar solution of us is floyCLI from floydhub and https://polyaxon.com/

jlewi commented 6 years ago

@kkasravi @wbuchwalter @gaocegege thoughts?

One question I have is when would we suggest users to use lower level tools (e.g. directly write YAML files and use kubectl) vs using Arena?

jlewi commented 6 years ago

What do folks think about just starting to incorporate Arena and seeing where it leads?

kkasravi commented 6 years ago

@jlewi +1

I think using the clientset API within a golang program needs to be explored and the reasons are similar to why bootstrapper uses a rest API from golang. I would suggest we look for ways to make the API extensible so that new or different methods can be bound within the spf13/cobra command set. One area I had prototyped was dynamically loading .so's

        p, err := plugin.Open(os.Args[1])
        if err != nil {
                panic(err)
        }
        sym, err := p.Lookup("CmdFoo")

but adopting something similar to kubectl plugin architecture may be more extensible.

gaocegege commented 6 years ago

I think for newcomers or entry-level data scientists/ML engineers, we should provide a CLI/simplified API to help them run their jobs easily. Because there are some users do not understand the concepts of Kubernetes and do not know how to use kubectl to create resources on Kubernetes, and they do not want to learn. And that's why I suggested building a unified API layer here: https://docs.google.com/document/d/1RkNL6XY7rR4eaW1TuM-loMuX9Dm5pFi5wpaFnnrH5LM/edit?usp=sharing

As for the advanced users, we should keep the kubernetes way. In this way, they could have some low-level configuration for their training jobs.

wsxiaozhang commented 6 years ago

@jlewi good question of when we suggest users to use kubectl or arena. I think it depends on what role the user is. In our observation, most of AI organizations have one data scientists team and one engineering/operation team. Data scientists focus on algorithm and data processing, and repeatedly submit training jobs to cluster. Engineers/Operators take care of infrastructure and K8S cluster. Most data scientists don't and needn't care about K8S, even docker. All K8S related details are transparent to them. Arena is created for them. Meanwhile, people still can manage everything via lower level tools like kubectl/helm if they like to customize their job.

jlewi commented 6 years ago

I went ahead and created the arena repository. Please follow these directions to continue setting it up: https://github.com/kubeflow/community/blob/master/repository-setup.md

I've created an initial OWNERS file with @cheyang so he can approve changes including adding additional approvers and reviewers.

I created a new repo rather than transferring the existing repo, because I'd like a record of the CLA being signed as part of code submission.

We can use: https://github-issue-mover.appspot.com/

To move issues if desired.

jlewi commented 6 years ago

@wsxiaozhang my question is more about when we tell users to switch from submitting jobs via arena/CLI to writing YAML files. For example, are there modifications (adding volumes, setting resource requests, environment variables) which arena will explicitly not support?

We had originally tried (using ksonnet prototype parameters) to make it easy for users to customize TFJob and TFServing just by setting parameters.

In practice, we found that this led to very complex prototypes that were hard to understand. As a result, we've been moving more in the direction of treating "prototypes" as examples that people copy and then modify.

As an example of the complexity you can look at the TFServing prototype https://github.com/kubeflow/kubeflow/blob/master/kubeflow/tf-serving/tf-serving.libsonnet

We wanted to make it easy for people to load their model from different object stores (e.g. GCS or S3). Each of these requires setting different environment variables and volume mounts; some of which might need to be customized by the user.

This leads to an every growing number of parameters the user can set (e.g https://github.com/kubeflow/kubeflow/blob/4864eed319f5c562426de9e25f8c7bfaf52029c2/kubeflow/tf-serving/prototypes/tf-serving-all-features.jsonnet ) many of which aren't relevant to the user (e.g. none of the S3 parameters are relevant if you aren't running on S3).

The complexity will increase as we try to support more ways of running Kubernetes. For example, at least in the past Azure and GCP used different names for GPU resources.

Looking at the arena command above; it already has 10 parameters. At what point is it more convenient and better for reproducibility to start checking in YAML files containing the parameters for each run?

jlewi commented 6 years ago

I think a good model for a CLI to submit jobs is kubectl run.

kubectl run is very convenient for a certain set of use cases where you largely just need to specify 3 parameters (name, image, command) and creating a YAML file would be unnecessarily cumbersome.

I don't think of kubectl run as creating a simpler API since its using the same API (Pod) as if you created the object yourself.

If we find the CLI moving in the direction of defining a substantially different Job API then the underlying operators we should pause and think about the path forward.

cheyang commented 6 years ago

I created a new repo rather than transferring the existing repo, because I'd like a record of the CLA being signed as part of code submission.

@jlewi Is it possible to transfer the existing repo? Because we'd love to keep existing PR, Forks and stars. We can require new PRs with CLA signed. Thanks.

jlewi commented 6 years ago

Transfer is complete: https://github.com/kubeflow/arena

kubeflow / community

[Request] Repository for Arena #164