llhuii / sedna

AI tookit over KubeEdge
Apache License 2.0
0 stars 1 forks source link

Add pod template like support #2

Open llhuii opened 3 years ago

llhuii commented 3 years ago

pod-template like support at worker:

current state

the spec definitions of worker:

 type WorkerSpec struct {
    ScriptDir        string     `json:"scriptDir"`
    ScriptBootFile   string     `json:"scriptBootFile"`
    FrameworkType    string     `json:"frameworkType"`
    FrameworkVersion string     `json:"frameworkVersion"`
    Parameters       []ParaSpec `json:"parameters"`
 }

 // ParaSpec is a description of a parameter
 type ParaSpec struct {
    Key   string `json:"key"`
    Value string `json:"value"`
 }
  1. ScriptDir/ScriptBootFile is the entrypoint of worker, localpath or central storage(e.g. s3).
  2. FrameworkType/FrameworkVersion specifies the base container image of worker.
  3. Parameters specifies the environment of worker.

pros

  1. simply for demo

    cons

  2. don't support docker-container cap: code version mgmt, distribution etc.
  3. don't support k8s pod similar features: resource limits, user defined volumes etc.
  4. need central storage(e.g. s3) for code if not localpath.
  5. need to build base image if the current base image can't satisfy the user requirements(user-defined code package dependents, or new framework). And then reedit the configuration of GM and restart it.

proposals: Add pod template support for workers

proposal 1: just pod template

And deprecate the current ScriptDir.

import v1 "k8s.io/api/core/v1"
 type WorkerSpec struct {
    v1.PodTemplateSpec `json:",inline"`
 }

examples and discussions

joint-inference-service

so in this proposal, the example of joint-inference in here would be:

apiVersion: sedna.io/v1alpha1
kind: JointInferenceService
metadata:
  name: example
spec:
  edgeWorker:
    model:
      name: "small-model"
    nodeName: "edge0"
    hardExampleMining:
      name: "IBT"
    workerSpec:
      containers:
      - image: edge-inference-worker:latest
        imagePullPolicy: Always
        env:  # user defined environments
        - name: nms_threshold
          value: "0.6"
        ports:  # user defined ports
          - containerPort: 80
            protocol: TCP
        resources:  # user defined resources
          requests:
            memory: 64Mi
            cpu: 100m
          limits:
            memory: 512Mi
        volumeMounts:
          - name: localvideo
            mountPath: /data/
      volumes:   # user defined volumes
        - name: localvideo
          emptyDir: {}

  cloudWorker:
    model:
      name: "big-model"
    nodeName: "solar-corona-cloud"
    workerSpec:
      containers:
        - image: cloud-inference-worker:latest
          imagePullPolicy: Always
          env:  # user defined environments
            - name: nms_threshold
              value: "0.6"
          ports:  # user defined ports
            - containerPort: 80
              protocol: TCP
          resources:  # user defined resources
            limits:
              memory: 2Gi

something need to discuss for joint inference service:

  1. where's the resource limits of model? share with the container resource limits?
  2. where's the serving container-side port of cloudworker?
  3. cloudWorker's workerSpec is needed? the user may only specify the big model.

federated-learning-job

so in this proposal, the example of joint-inference in here would be:

apiVersion: sedna.io/v1alpha1
kind: FederatedLearningJob
metadata:
  name: surface-defect-detection
spec:
  aggregationWorker:
    model:
      name: "surface-defect-detection-model"
    nodeName: "cloud0"
    # where's the serving port of aggregator worker
    workerSpec:
      containers:
        - image: aggregator-worker:latest
          imagePullPolicy: Always
          env:  # user defined environments
            - name: exit_round
              value: "0.3"
          ports:
            - containerPort: 80
              protocol: TCP
          resources:  # user defined resources
            requests:
              memory: 64Mi
              cpu: 100m
            limits:
              memory: 512Mi

  trainingWorkers:
    - nodeName: "edge1"
      dataset:
        name: "edge-1-surface-defect-detection-dataset"
      workerSpec:
        containers:
          - image: training-worker:latest
            imagePullPolicy: Always
            env:  # user defined environments
              - name: batch_size
                value: "0.3"
              - name: learning_rate
                value: "0.001"
              - name: epochs
                value: "1"
            resources:  # user defined resources
              requests:
                memory: 64Mi
                cpu: 100m
              limits:
                memory: 512Mi

    - nodeName: "edge2"
      dataset:
        name: "edge-2-surface-defect-detection-dataset"
      workerSpec:
        containers:
          - image: training-worker:latest
            imagePullPolicy: Always
            env:  # user defined environments
              - name: batch_size
                value: "0.3"
              - name: learning_rate
                value: "0.001"
              - name: epochs
                value: "1"
            resources:  # user defined resources
              requests:
                memory: 64Mi
                cpu: 100m
              limits:
                memory: 512Mi

incremental-learning-job

the common problem:

  1. find to a good way to write the openapi of crd since podSpec has a lot of fields.

deployment support

using the feature of deployment:

  1. replica pod in case pod failure

alternative: using replicaSet


 type DeploymentSpec struct {
   Replicas *int32 `json:"replicas,omitempty"`
   Template WorkerSpec `json:"template"`
   // etc.
 }

daemonset support

use case:

  1. running training worker of federated learning in every node of a group.

 type DaemonsetSpec struct {
   Selector *metav1.LabelSelector `json:"selector"`
   Template WorkerSpec `json:"template"`
   // etc.
 }
llhuii commented 3 years ago
  1. find to a good way to write the openapi of crd since podSpec has a lot of fields.

solved by kubebuilder generate tool.