[RFC] Kubeflow Integration

Rosie-m commented 1 year ago

image-zeus-kube

Motivation

Kubeflow is an open-source Kubernetes-native platform for developing, orchestrating, deploying, and running scalable and portable ML workloads. This integration will enable developers in the industry to directly deploy Zeus onto their Kubeflow cluster, or serve as an example of integrating Zeus into their internal MLOps platforms. By facilitating the adoption of Zeus into the industry, we hope to encourage tech companies to try out Zeus and make their ML systems energy-efficient.

Brief Background

In Zeus, a job will sequentially recur for a given number of times. In each recurrence, it will launch the training script until it converges (i.e. reach a user-defined target metric) or reaches the upper limit of retries.

Before each trial, we will use the batch size optimizer (BSO), which runs Multi-Armed Bandits, to predict the next batch size to use for this try. After each try, we will feedback on the result of this trial, including time, energy, cost, and whether this try converged to BSO to help future predictions. The one-time launch of the training script and the pre- and post-interaction with BSO is referred to as a trial.

Main Challenges

Keep states between training jobs and trials. Original Zeus relies on the local filesystem to store the training and profiling results. But as we are using Kubernetes to automatically launch the training script and the Kubernetes control plane automatically handles scheduling the pods across the nodes in the cluster. We can not assume everything happens on the same node anymore. Instead, we need to take another approach to keep the training and profiling results across runs.

Proposed Design

We envision building an end-to-end system that allows users to use Zeus transparently. Zeus has two key components: a just-in-time (JIT) online profiler, and batch size optimizer with multi-armed bandit. We will have a server that asynchronously serves BSO for multiple jobs, and we will have an extended JIT online profiler running on the client side that report profiling and training results back to the server.

Overview

Zeus + Kubeflow will contain the following components:

Server-side:
- ZeusServer: A server that accepts user-submitted jobs, serves BSO, and handles user queries.
- Database (DB): Stores all the log data, including submitted jobs, launched trials, and profiling data (power, time, and so on). This is the single source of truth in our design.
Client-side:
- Extended ZeusDataLoader: we will extend Zeus online profiling by incorporating a MetricManager to report results to the server.

An End-to-End View

The above figure shows an end-to-end view. We now explain how the components work together.

Job Creation and Trial Launch

User creates a new Job
- Lauches 1st Trial of this Job
- User-side:
  - ZeusDataLoader.__init__ create a new UUID as the job_id for this Job.
  - ZeusDataLoader.__init__ will send a POST request to ZeusServer with its job_id for the batch size to use for this Trial.
- Server-side:
  - ZeusServer validate the POST request to create a new job and insert the new Job to DB.Jobs.
  - ZeusServer insert the Trial to DB.Trials.
  - ZeusServer creates an asyncio.Task to server BSO for the Job.
  - ZeusServer replies with the batch size to the user.
- User-side:
  - ZeusDataLoader.__init__ will then use the batch size to initialize DataLoader and then start training.
- Later Trials of the same Job:
- User-side:
  - ZeusDataLoader.__init__ will send a POST request to ZeusServer with the existing job_id for the batch size to use for this Trial.
- Server-side:
  - ZeusServer validates the POST request for a new Trial with existing Job.
  - ZeusServer insert the Trial to DB.Trials.
  - ZeusServer routes the request to the existing asyncio.Task of this Job.
  - ZeusServer replies with the batch size to the user.
- User-side:
  - ZeusDataLoader.__init__ will then use the batch size to initialize DataLoader and then start training.

NOTE: Each Job has one asyncio.Task that serves its BSO instance.

Training and Profiling

Client-side:
- A Trial runs ./train.py once
- At the end of profiling, Trial needs to report the ProfilingResult to ZeusServer
  - ZeusServer will then insert into DB.Profiling.
- At exit, Trial needs to report (energy, time, cost, num_epochs, reached) to ZeusServer
  - ZeusServer will then update the corresponding record of this Trial in DB.Trials.
- How ZeusDataLoader report profiling/trianing results to the server?
- ZeusDataLoader does a POST request to the ZeusServer with TrialResult (or ProfilingResult).
Server-side:
- FastAPI app routes TrialResult (or ProfilingResult) to the global ZeusServer class instance.
- ZeusServer instance routes TrialResult (or ProfilingResult) to asyncio.Task for that job through an asyncio.Queue channel.
- Task was await _job_trial_result_channel.get()' ing, and get TrialResult (or await _job_profiling_result_channel.get()' ing, and get ProfilingResult.
- Update the record based on TrialResult in DB.Trials (or ProfilingResult in DB.Profiling).
  Detailed Look into Each Component

Here, we will explain the details of each component and what they provide.

ZeusServer

ZeusServer contains the following sub-components:

Endpoints

DBAPI: an INSERT/UPDATE interface that interacts with DB for storing and querying information across jobs and trials.
- We will use object-relational mapping (ORM), which maps between objects in code and database tables.
- We will use Tortoise ORM + PostgreSQL.
BatchSizeOptimizer (BSO): A component from Zeus that predicts the optimal batch size. BSO learns from the feedback (results of each trial) and adjusts its internal states. BSO achieves this by implementing Multi-Armed Bandits (MAB) with Thompson sampling.

Database

The Database store states across jobs and trials

It contains three tables: Jobs, Trials, and Profiling.

Jobs

Records all the jobs submitted by users
Append & Update-Only
- A job is inserted by ZeusServer after a user submits it to ZeusServer.
- The phase of a job is changed from Running to Completed after the num_recurrences has done.

job_id	user_id	seed	default_batch_size	min_batch_size	max_batch_size	eta_knob	beta_knob	target_metric	max_epochs	num_recurrences	max_retries	phase
d956a2b5-ce07-44e3-879d-9f257a3acb08	luoxim	1	1024	8	4096	0.5	2.0	0.50	100	100	20	Running

Trials

Records all trials created
- Append & Update-Only
- A trial is inserted by ZeusServer after its completion.
This table contains the same information as train_json in original Zeus.

job_id	rec_i	trial_i	batch_size	time	energy	cost	num_epochs	reached	phase
d956a2b5-ce07-44e3-879d-9f257a3acb08	1	1	1024	508.696199872531	117868.43460299837	135238.64728237884	28	true	Running

Profiling

For each job, records the following mappings:
- Train (one record for each power_limit)
- power_limit -> train_avg_power
- power_limit -> train_tput
- Eval (only one record for opt_power_limit)
- opt_power_limit -> eval_avg_power
- opt_power_limit -> eval_tput
Append-only
- For each Trial, inserted at the first end of epoch when profiling is done.
This table contains the same information as power_json in the original Zeus.

job_id	value_type	batch_size	phase	power_limit	value	rec_i	trial_i
d956a2b5-ce07-44e3-879d-9f257a3acb08	power	32	train	300000	131.93493277891338	6	1
d956a2b5-ce07-44e3-879d-9f257a3acb08	power	32	train	275000	123.66380334160725	6	1
d956a2b5-ce07-44e3-879d-9f257a3acb08	tput	32	train	300000	31.03646417467191	6	1
d956a2b5-ce07-44e3-879d-9f257a3acb08	tput	32	train	275000	29.93935643421058	6	1
d956a2b5-ce07-44e3-879d-9f257a3acb08	power	32	eval	175000	125.63629920513313	6	1
d956a2b5-ce07-44e3-879d-9f257a3acb08	tput	32	eval	175000	114.86617394848754	6	1

(Example job CIFAR100 with ShuffleNet)

Extended `ZeusDataLoader`

We will extend ZeusDataLoader so that:

Register the job and query the batch size at initialization.
- Each new job should generate its unique job_id. This is also the unique identifier in DB.Jobs table on the server side.
- Send a POST request to ZeusServer. BSO served on the server side will then generate the next batch size to use and reply to the client.
Report Results.
- Reporting TrialResult. At the exit, ZeusDataLoader will report the TrialResult, including time, energy, cost, number of epochs completed and whether the target metric is reached, to ZeusServer.
- Reporting ProfilingResult. After profiling is done, ZeusDataLoader will report the ProfilingResult to ZeusServer.
- We will introduce a MetricManager to ZeusDataLoader for the above two reporting purposes. MetricManager will decide whether to store the results in the local filesystem (the practice of original Zeus) or send POST requests to ZeusServer (required in this integration).

Communication

In Kubernetes, a Service defines a set of pods (a group of one or more containers with shared resources and specifications) and how to access them. In our design, ZeusServer and PytorchJobs are pods on the same cluster. The communication between them can be achieved by querying Cluster-IP. However, for users outside the cluster to send GET or POST requests, we will need to define a NodePort service.

Failure Handling

Natural failures (i.e. failures in training scripts)
- Forget & assume that this batch size never ran
- Fixing the script and re-running the job is anyway up to the user
Zeus-induced failures (Zeus has bugs or Zeus kills the training on purpose)
- E.g. Batch size didn’t converge to target metric, estimated next epoch cost exceeds threshold
- Trial failed, reached == False
OOM
- Rule out any batch size >= current batch size

Testing with Kubeflow/`PytorchJob`

For Kubeflow users, it is very convenient to create PytorchJobs to launch their training scripts (i.e. Trial). We need to do the same while testing our design. Following are the things we need to take care of when creating PytorchJobs:

Passing parameters. We will create PytorchJobs with Kubernetes API, which allows us to configure the command to run as well as environment variables. Same as Zeus, the job-specific parameters are specified in the command and the Zeus-specific parameters are specified as environment variables.
Setting system capability. Add SYS_ADMIN.
Setting resource limit. Number of GPUs, etc.

Here is an example in the .yaml format.

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: cifar100
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: symbioticlab/zeus:latest
              imagePullPolicy: Always
              command:
                - "python"
                - "/workspace/zeus/examples/cifar100/train.py"
                - "--zeus"
                - "--arch"
                - "shufflenetv2"
                - "--epochs"
                - "100"
                - "--batch_size"
                - "128"
              env:
                - name: ZEUS_TARGET_METRIC
                  value: "0.50"
                - name: ZEUS_LOG_DIR
                  value: "zeus_log"
                - name: ZEUS_JOB_ID
                  value: "zeus"
                - name: ZEUS_COST_THRESH
                  value: "inf"
                - name: ZEUS_ETA_KNOB
                  value: "0.5"
                - name: ZEUS_MONITOR_PATH
                  value: "/workspace/zeus/zeus_monitor/zeus_monitor"
                - name: ZEUS_PROFILE_PARAMS
                  value: "10,40"
                - name: ZEUS_USE_OPTIMAL_PL
                  value: "True"
              securityContext:
                capabilities:
                  add: ["SYS_ADMIN"]
              resources:
                limits:
                  nvidia.com/gpu: 1 # requesting 1 GPU

Future Directions

Selective repetitiveness with different frequencies

Currently, the training script (train.py) contains both data preprocessing and training. However, we might want to decouple these two phases. One example is that, for the experiment purpose, we might want to run training multiple times on the same data to try out different batch sizes. In this case, preprocessing data multiple times will be an unnecessary overhead.

In our design, the ideal case will be that we run data preprocessing once before each Recurrence, and do NOT repeat it in each Trial.

Down this line, we plan to support user-defined frequencies for each stage in the ML pipeline. A snapshot of our plan will be:

Use Kubeflow Pipeline to construct a DAG
Support user-specified different frequencies between the different stages
Support execution of the pipeline based on the frequency mapping
- Extend Argo Workflows to support modeling workflows as DAGs with backedges

Progress

Following is the list of components and the latest progress:

ZeusServer:
- [x] APIs between ZeusServer and clients.
- APIs in kube/server/main.py.
- Model definitions (i.e. data structures communicated between server and clients) in kube/server/models.py.
- [ ] ZeusServer class that manages the training jobs
- In kube/server/server.py. This singleton class contains all the functions that do the "real" work on the server side.
Database and ORM
- [x] Table schema kube/db/schema.md.
- [ ] Functions that ZeusServer will use to store and query states (see kube/server/dbapis.py). Basically, send DB query and return the result of ZeusServer.
Extended ZeusDataLoader
- [ ] Receive job_id from the user as a CLI argument. Note this is a job-specific parameter.
- [ ] Extend ZeusDataLoader.
- In __init__, register the job (i.e. POST to ZeusServer) and receive the batch size before initializing the DataLoader.
- Report ProfilingResult when profiling is done.
- Report TrialResult when training is done.

Environment Setup

Git clone SymbioticLab/Zeus.

Get k3s with Docker as the container runtime

curl -sfL https://get.k3s.io | sh -s - --docker

Install Kubeflow manually
Install FastAPI
```
pip install "fastapi[all]"
```

Learning Materials

FastAPI
- Tutorial: FastAPI and asynchronous programming
Kubernetes
- Tutorial: Learn Kubernetes and try out minikube
- Docs: Understand Kubernetes

jaywonchung commented 1 year ago

Thank you for the great write up! This is going to be really cool, and will push Zeus to another level of open source software. I'm super excited about it!!

Some random comments and questions:

A new recurrence may begin even before the current recurrence ends (i.e. multiple Running recurrences). Make sure your code/architecture doesn't assume that it doesn't.
FastAPI seems to work well with SQLAlchemy.
Which DB will you use? MySQL? PostgreSQL? Choose one that your ORM framework has first class support for.
What about adding an API endpoint to list jobs based on filters? You can start with a simple endpoint that just lists all jobs, for the purpose of development. Not sure who manages the IDs of jobs, but when they're lost, you'll need a way to list all jobs and their IDs to make requests referencing them.
Any native objects you need from K8s, like a Service to put in front of the Zeus server? PyTorchJobs inside the cluster and users outside the cluster should both be able to reach the Zeus server endpoint. The former can just be done with the default ClusterIP server type. Regarding the latter, since we're using a custom deployment of K8s (not a managed deployment like EKS), we can just start with a NodePort service.
Implement the API client as a library, and then create a simple CLI wrapper around that library. That'll allow future users to use that client library to build other things or programmatically use Zeus.
Inconsistency between GET /jobs/{job_id} and GET /trials in terms of query vs path parameter.
Names like rec_i or trial_i in the API are not readable/comprehensible.
Path /trials/report_profiling sounds awkward since "profiling" is a present participle, not a noun. Either /trials/report or /trials/report_profiling_result look better to me.
We probably need to create an environment variable to tell ZeusDataLoader that it's inside a KubeFlow cluster and it should make a POST request to the ZeusServer at the end of training.
What are some future works down the line of this integration? For example, we talked about launching entire KubeFlow Pipelines (which includes a PyTorchJob) but only executing pre/post-processing once and retrying only the PyTorchJob when necessary.

Rosie-m commented 1 year ago

Thanks for the great comments! Based on the comments and our discussion today, I listed all the changes I plan to make. Could you please take another look? Thx!!

Add the Choice of ORM and DB

We will use Tortoise ORM + PostgreSQL.

Update Naming

rec_i => recurrence_number

trial_i => trial_number

Update `ZeusServer`

Execution of a single `Job`

We will change our design from executing each Job sequentially to executing Recurrences within one Job concurrently. This will allow the Recurrences within one Job to overlap with each other. It also means the period of time between two recurrences will only depend on the trigger condition (when either elapsed time is satisfied or a drift is detected). Each Recurrence is still executed sequentially.

A job will create a dedicated asyncio.Task for each Recurrence without waiting for its completion.
After all the Recurrences are created, the job will wait for all of them to finish, and then exit.

API Endpoints

`POST /trials/report_profiling_result`

Endpoint for PytorchJobs to report profiling results.

`GET /jobs`

Endpoint for users to query Jobs.

GET /jobs will list all the jobs submitted by users. We will further enable users to filter the jobs based on query parameters, for example, GET /jobs?phase=completed will list all the completed jobs.

Also, job_id will also be defined as a query parameter instead of a path parameter. For example, we will use GET /jobs/?job_id=000001 instead of GET /jobs/000001 to query the information of the job with id 000001.

K8S Services

In K8S, a Service defines a set of pods (a group of one or more containers with shared resources and specifications) and how to access them. In our design, ZeusServer and PytorchJobs are pods on the same cluster. The communication between them can be achieved by querying Cluster-IP. However, for users outside the cluster to send GET or POST requests, we will need to define a NodePort service.

Update Roadmap

Client Library & CLI Wrapper

We will write a client library that sends request to ZeusServer API. We will then create a CLI wrapper around the client library. This add-ons will ease future uasage of this integration.

Add Future Directions

Selective repetitiveness with different frequencies

Currently, the training script (train.py) contains both data preprocessing and training. However, we might want to decouple these two phases. One example is that, for the experiment purpose, we might want to run training multiple times on the same data to try out different batch sizes. In this case, preprocessing data multiple times will be an unnecessary overhead.

In our design, the ideal case will be that we run data preprocessing once before each Recurrence, and do NOT repeat it in each Trial.

Down this line, we plan to support user-defined frequencies for each stage in the ML pipeline. A snapshot of our plan will be:

Use Kubeflow Pipeline to construct a DAG
Support user-specified different frequencies between the different stages
Support execution of the pipeline based on the frequency mapping
- Extend Argo Workflows to support modeling workflows as DAGs with backedges

jaywonchung commented 1 year ago

Thanks! Looks great.

Comment I want to add for being recurrence overlap aware:

Configured poorly (e.g. with recurrence period one second and no maximum recurrences set), this design will lead to an explosion of PyTorchJobs. Not sure what the best way to handle this situation is, yet.
If we just keep a list of Tasks associated with each recurrence and if the job recurs infinitely, memory will keep increasing. We'll have to periodically traverse the list of Tasks and remove those that completed.

Rosie-m commented 1 year ago

I am thinking about two solutions to the problems you mentioned.

We constrain the number of running recurrences for each job. The condition to launch a new recurrence will be: 1) elapsed time is satisfied or drift is detected, and 2) there is a free slot for this new recurrence. This will constrain the memory required while providing some sense of concurrency.
Setting a lower bound for the recurrence period is hard. One potential solution is to set it as the time for one recurrence. Then, it is equivalent to no overlapping, i.e. only one recurrence is running and the whole job will be executed sequentially. This also makes sense to me, since the overlapping does waste some BSO feedback. In this case, the time between two recurrences will be

$$\max{[ T(OneRecurrence), \min{[T(RecurrencePeriod), T(DriftDetected)]} ]}$$

jaywonchung commented 1 year ago

I like 1 better. This can also act like a resource usage cap for users to ensure fairness when multiple users use the same cluster.

jaywonchung commented 1 year ago

Just dumping a quick thought about MetricManager.

The current implementation of ZeusDataLoader keeps profiled metrics as class variables so that it can be shared between the train and eval dataloaders, but arguably this design is not the easiest to understand. Instead, maybe the FileBackedMetricManager class can keep JSON file(s) on the local filesystem. All read and writes of throughput, power, etc go to the file.

It'll be possible to create a write-back cache inside the metric manager and share the metric manager between the train and eval dataloader by storing it was a class variable of ZeusDataLoader. Or, you may just not do this and have all reads/writes access the JSON file.

Rosie-m commented 1 year ago

I will create a MetricManager that stores and manages power and train metrics. The MetricManager will be fully file-backed, reading and writing to the local filesystem. If we are in the Kubeflow mode, besides writing to the local filesystem, MetricManager will POST the metrics to ZeusServer, who is responsible to store them in DB.

This design will abstract away the metrics store and reporting logic from the current ZeusDataLoader.

mosharaf commented 1 year ago

What's the status of this? @Rosie-m @jaywonchung

jaywonchung commented 1 year ago

@Rosie-m will clean and push existing code in a branch and document progress.

Rosie-m commented 1 year ago

I have updated the issue content based on the design changes we discussed and have a summary of the current progress. The existing code can be found in kubeflow branch. @jaywonchung @mosharaf

jaywonchung commented 1 year ago

Everything looks good! Thank you for your work @Rosie-m 👍

ml-energy / zeus