ml-energy / zeus

Deep Learning Energy Measurement and Optimization
https://ml.energy/zeus
Apache License 2.0
180 stars 24 forks source link

[RFC] Kubeflow Integration #11

Closed Rosie-m closed 2 months ago

Rosie-m commented 1 year ago

image-zeus-kube

Motivation

Kubeflow is an open-source Kubernetes-native platform for developing, orchestrating, deploying, and running scalable and portable ML workloads. This integration will enable developers in the industry to directly deploy Zeus onto their Kubeflow cluster, or serve as an example of integrating Zeus into their internal MLOps platforms. By facilitating the adoption of Zeus into the industry, we hope to encourage tech companies to try out Zeus and make their ML systems energy-efficient.

Brief Background

In Zeus, a job will sequentially recur for a given number of times. In each recurrence, it will launch the training script until it converges (i.e. reach a user-defined target metric) or reaches the upper limit of retries.

Before each trial, we will use the batch size optimizer (BSO), which runs Multi-Armed Bandits, to predict the next batch size to use for this try. After each try, we will feedback on the result of this trial, including time, energy, cost, and whether this try converged to BSO to help future predictions. The one-time launch of the training script and the pre- and post-interaction with BSO is referred to as a trial.

image-20230104145850817

Main Challenges

Proposed Design

We envision building an end-to-end system that allows users to use Zeus transparently. Zeus has two key components: a just-in-time (JIT) online profiler, and batch size optimizer with multi-armed bandit. We will have a server that asynchronously serves BSO for multiple jobs, and we will have an extended JIT online profiler running on the client side that report profiling and training results back to the server.

Overview

Zeus + Kubeflow will contain the following components:

An End-to-End View

image

The above figure shows an end-to-end view. We now explain how the components work together.

Job Creation and Trial Launch

image

NOTE: Each Job has one asyncio.Task that serves its BSO instance.

Training and Profiling

image

Here, we will explain the details of each component and what they provide.

ZeusServer

ZeusServer contains the following sub-components:

image image image image image image

Database

The Database store states across jobs and trials

It contains three tables: Jobs, Trials, and Profiling.

Jobs
job_id user_id seed default_batch_size min_batch_size max_batch_size eta_knob beta_knob target_metric max_epochs num_recurrences max_retries phase
d956a2b5-ce07-44e3-879d-9f257a3acb08 luoxim 1 1024 8 4096 0.5 2.0 0.50 100 100 20 Running
Trials
job_id rec_i trial_i batch_size time energy cost num_epochs reached phase
d956a2b5-ce07-44e3-879d-9f257a3acb08 1 1 1024 508.696199872531 117868.43460299837 135238.64728237884 28 true Running
Profiling
job_id value_type batch_size phase power_limit value rec_i trial_i
d956a2b5-ce07-44e3-879d-9f257a3acb08 power 32 train 300000 131.93493277891338 6 1
d956a2b5-ce07-44e3-879d-9f257a3acb08 power 32 train 275000 123.66380334160725 6 1
d956a2b5-ce07-44e3-879d-9f257a3acb08 tput 32 train 300000 31.03646417467191 6 1
d956a2b5-ce07-44e3-879d-9f257a3acb08 tput 32 train 275000 29.93935643421058 6 1
d956a2b5-ce07-44e3-879d-9f257a3acb08 power 32 eval 175000 125.63629920513313 6 1
d956a2b5-ce07-44e3-879d-9f257a3acb08 tput 32 eval 175000 114.86617394848754 6 1

(Example job CIFAR100 with ShuffleNet)

Extended ZeusDataLoader

We will extend ZeusDataLoader so that:

Communication

In Kubernetes, a Service defines a set of pods (a group of one or more containers with shared resources and specifications) and how to access them. In our design, ZeusServer and PytorchJobs are pods on the same cluster. The communication between them can be achieved by querying Cluster-IP. However, for users outside the cluster to send GET or POST requests, we will need to define a NodePort service.

Failure Handling

Testing with Kubeflow/PytorchJob

For Kubeflow users, it is very convenient to create PytorchJobs to launch their training scripts (i.e. Trial). We need to do the same while testing our design. Following are the things we need to take care of when creating PytorchJobs:

Here is an example in the .yaml format.

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: cifar100
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: symbioticlab/zeus:latest
              imagePullPolicy: Always
              command:
                - "python"
                - "/workspace/zeus/examples/cifar100/train.py"
                - "--zeus"
                - "--arch"
                - "shufflenetv2"
                - "--epochs"
                - "100"
                - "--batch_size"
                - "128"
              env:
                - name: ZEUS_TARGET_METRIC
                  value: "0.50"
                - name: ZEUS_LOG_DIR
                  value: "zeus_log"
                - name: ZEUS_JOB_ID
                  value: "zeus"
                - name: ZEUS_COST_THRESH
                  value: "inf"
                - name: ZEUS_ETA_KNOB
                  value: "0.5"
                - name: ZEUS_MONITOR_PATH
                  value: "/workspace/zeus/zeus_monitor/zeus_monitor"
                - name: ZEUS_PROFILE_PARAMS
                  value: "10,40"
                - name: ZEUS_USE_OPTIMAL_PL
                  value: "True"
              securityContext:
                capabilities:
                  add: ["SYS_ADMIN"]
              resources:
                limits:
                  nvidia.com/gpu: 1 # requesting 1 GPU

Future Directions

Selective repetitiveness with different frequencies

Currently, the training script (train.py) contains both data preprocessing and training. However, we might want to decouple these two phases. One example is that, for the experiment purpose, we might want to run training multiple times on the same data to try out different batch sizes. In this case, preprocessing data multiple times will be an unnecessary overhead.

In our design, the ideal case will be that we run data preprocessing once before each Recurrence, and do NOT repeat it in each Trial.

Down this line, we plan to support user-defined frequencies for each stage in the ML pipeline. A snapshot of our plan will be:

Progress

Following is the list of components and the latest progress:

Environment Setup

  1. Git clone SymbioticLab/Zeus.
  2. Get k3s with Docker as the container runtime
    curl -sfL https://get.k3s.io | sh -s - --docker
  3. Install Kubeflow manually
  4. Install FastAPI
    pip install "fastapi[all]"

Learning Materials

jaywonchung commented 1 year ago

Thank you for the great write up! This is going to be really cool, and will push Zeus to another level of open source software. I'm super excited about it!!

Some random comments and questions:

Rosie-m commented 1 year ago

Thanks for the great comments! Based on the comments and our discussion today, I listed all the changes I plan to make. Could you please take another look? Thx!!

Add the Choice of ORM and DB

We will use Tortoise ORM + PostgreSQL.

Update Naming

rec_i => recurrence_number

trial_i => trial_number

Update ZeusServer

Execution of a single Job

We will change our design from executing each Job sequentially to executing Recurrences within one Job concurrently. This will allow the Recurrences within one Job to overlap with each other. It also means the period of time between two recurrences will only depend on the trigger condition (when either elapsed time is satisfied or a drift is detected). Each Recurrence is still executed sequentially.

API Endpoints

POST /trials/report_profiling_result

Endpoint for PytorchJobs to report profiling results.

GET /jobs

Endpoint for users to query Jobs.

GET /jobs will list all the jobs submitted by users. We will further enable users to filter the jobs based on query parameters, for example, GET /jobs?phase=completed will list all the completed jobs.

Also, job_id will also be defined as a query parameter instead of a path parameter. For example, we will use GET /jobs/?job_id=000001 instead of GET /jobs/000001 to query the information of the job with id 000001.

K8S Services

In K8S, a Service defines a set of pods (a group of one or more containers with shared resources and specifications) and how to access them. In our design, ZeusServer and PytorchJobs are pods on the same cluster. The communication between them can be achieved by querying Cluster-IP. However, for users outside the cluster to send GET or POST requests, we will need to define a NodePort service.

Update Roadmap

Client Library & CLI Wrapper

We will write a client library that sends request to ZeusServer API. We will then create a CLI wrapper around the client library. This add-ons will ease future uasage of this integration.

Add Future Directions

Selective repetitiveness with different frequencies

Currently, the training script (train.py) contains both data preprocessing and training. However, we might want to decouple these two phases. One example is that, for the experiment purpose, we might want to run training multiple times on the same data to try out different batch sizes. In this case, preprocessing data multiple times will be an unnecessary overhead.

In our design, the ideal case will be that we run data preprocessing once before each Recurrence, and do NOT repeat it in each Trial.

Down this line, we plan to support user-defined frequencies for each stage in the ML pipeline. A snapshot of our plan will be:

jaywonchung commented 1 year ago

Thanks! Looks great.

Comment I want to add for being recurrence overlap aware:

  1. Configured poorly (e.g. with recurrence period one second and no maximum recurrences set), this design will lead to an explosion of PyTorchJobs. Not sure what the best way to handle this situation is, yet.
  2. If we just keep a list of Tasks associated with each recurrence and if the job recurs infinitely, memory will keep increasing. We'll have to periodically traverse the list of Tasks and remove those that completed.
Rosie-m commented 1 year ago

I am thinking about two solutions to the problems you mentioned.

  1. We constrain the number of running recurrences for each job. The condition to launch a new recurrence will be: 1) elapsed time is satisfied or drift is detected, and 2) there is a free slot for this new recurrence. This will constrain the memory required while providing some sense of concurrency.

  2. Setting a lower bound for the recurrence period is hard. One potential solution is to set it as the time for one recurrence. Then, it is equivalent to no overlapping, i.e. only one recurrence is running and the whole job will be executed sequentially. This also makes sense to me, since the overlapping does waste some BSO feedback. In this case, the time between two recurrences will be

$$\max{[ T(OneRecurrence), \min{[T(RecurrencePeriod), T(DriftDetected)]} ]}$$

jaywonchung commented 1 year ago

I like 1 better. This can also act like a resource usage cap for users to ensure fairness when multiple users use the same cluster.

jaywonchung commented 1 year ago

Just dumping a quick thought about MetricManager.

The current implementation of ZeusDataLoader keeps profiled metrics as class variables so that it can be shared between the train and eval dataloaders, but arguably this design is not the easiest to understand. Instead, maybe the FileBackedMetricManager class can keep JSON file(s) on the local filesystem. All read and writes of throughput, power, etc go to the file.

It'll be possible to create a write-back cache inside the metric manager and share the metric manager between the train and eval dataloader by storing it was a class variable of ZeusDataLoader. Or, you may just not do this and have all reads/writes access the JSON file.

Rosie-m commented 1 year ago

I will create a MetricManager that stores and manages power and train metrics. The MetricManager will be fully file-backed, reading and writing to the local filesystem. If we are in the Kubeflow mode, besides writing to the local filesystem, MetricManager will POST the metrics to ZeusServer, who is responsible to store them in DB.

This design will abstract away the metrics store and reporting logic from the current ZeusDataLoader.

mosharaf commented 1 year ago

What's the status of this? @Rosie-m @jaywonchung

jaywonchung commented 1 year ago

@Rosie-m will clean and push existing code in a branch and document progress.

Rosie-m commented 1 year ago

I have updated the issue content based on the design changes we discussed and have a summary of the current progress. The existing code can be found in kubeflow branch. @jaywonchung @mosharaf

jaywonchung commented 1 year ago

Everything looks good! Thank you for your work @Rosie-m 👍