type WorkerSpec struct {
ScriptDir string `json:"scriptDir"`
ScriptBootFile string `json:"scriptBootFile"`
FrameworkType string `json:"frameworkType"`
FrameworkVersion string `json:"frameworkVersion"`
Parameters []ParaSpec `json:"parameters"`
}
// ParaSpec is a description of a parameter
type ParaSpec struct {
Key string `json:"key"`
Value string `json:"value"`
}
ScriptDir/ScriptBootFile is the entrypoint of worker, localpath or central storage(e.g. s3).
FrameworkType/FrameworkVersion specifies the base container image of worker.
Parameters specifies the environment of worker.
pros
simply for demo
cons
don't support docker-container cap: code version mgmt, distribution etc.
don't support k8s pod similar features: resource limits, user defined volumes etc.
need central storage(e.g. s3) for code if not localpath.
need to build base image if the current base image can't satisfy the user
requirements(user-defined code package dependents, or new framework).
And then reedit the configuration of GM and restart it.
proposals: Add pod template support for workers
proposal 1: just pod template
And deprecate the current ScriptDir.
import v1 "k8s.io/api/core/v1"
type WorkerSpec struct {
v1.PodTemplateSpec `json:",inline"`
}
examples and discussions
joint-inference-service
so in this proposal, the example of joint-inference in here would be:
apiVersion: sedna.io/v1alpha1
kind: JointInferenceService
metadata:
name: example
spec:
edgeWorker:
model:
name: "small-model"
nodeName: "edge0"
hardExampleMining:
name: "IBT"
workerSpec:
containers:
- image: edge-inference-worker:latest
imagePullPolicy: Always
env: # user defined environments
- name: nms_threshold
value: "0.6"
ports: # user defined ports
- containerPort: 80
protocol: TCP
resources: # user defined resources
requests:
memory: 64Mi
cpu: 100m
limits:
memory: 512Mi
volumeMounts:
- name: localvideo
mountPath: /data/
volumes: # user defined volumes
- name: localvideo
emptyDir: {}
cloudWorker:
model:
name: "big-model"
nodeName: "solar-corona-cloud"
workerSpec:
containers:
- image: cloud-inference-worker:latest
imagePullPolicy: Always
env: # user defined environments
- name: nms_threshold
value: "0.6"
ports: # user defined ports
- containerPort: 80
protocol: TCP
resources: # user defined resources
limits:
memory: 2Gi
something need to discuss for joint inference service:
where's the resource limits of model? share with the container resource limits?
where's the serving container-side port of cloudworker?
cloudWorker's workerSpec is needed? the user may only specify the big model.
federated-learning-job
so in this proposal, the example of joint-inference
in here
would be:
apiVersion: sedna.io/v1alpha1
kind: FederatedLearningJob
metadata:
name: surface-defect-detection
spec:
aggregationWorker:
model:
name: "surface-defect-detection-model"
nodeName: "cloud0"
# where's the serving port of aggregator worker
workerSpec:
containers:
- image: aggregator-worker:latest
imagePullPolicy: Always
env: # user defined environments
- name: exit_round
value: "0.3"
ports:
- containerPort: 80
protocol: TCP
resources: # user defined resources
requests:
memory: 64Mi
cpu: 100m
limits:
memory: 512Mi
trainingWorkers:
- nodeName: "edge1"
dataset:
name: "edge-1-surface-defect-detection-dataset"
workerSpec:
containers:
- image: training-worker:latest
imagePullPolicy: Always
env: # user defined environments
- name: batch_size
value: "0.3"
- name: learning_rate
value: "0.001"
- name: epochs
value: "1"
resources: # user defined resources
requests:
memory: 64Mi
cpu: 100m
limits:
memory: 512Mi
- nodeName: "edge2"
dataset:
name: "edge-2-surface-defect-detection-dataset"
workerSpec:
containers:
- image: training-worker:latest
imagePullPolicy: Always
env: # user defined environments
- name: batch_size
value: "0.3"
- name: learning_rate
value: "0.001"
- name: epochs
value: "1"
resources: # user defined resources
requests:
memory: 64Mi
cpu: 100m
limits:
memory: 512Mi
incremental-learning-job
the common problem:
find to a good way to write the openapi of crd since podSpec has a lot of fields.
deployment support
using the feature of deployment:
replica pod in case pod failure
alternative: using replicaSet
type DeploymentSpec struct {
Replicas *int32 `json:"replicas,omitempty"`
Template WorkerSpec `json:"template"`
// etc.
}
daemonset support
use case:
running training worker of federated learning in every node of a group.
type DaemonsetSpec struct {
Selector *metav1.LabelSelector `json:"selector"`
Template WorkerSpec `json:"template"`
// etc.
}
pod-template like support at worker:
current state
the spec definitions of worker:
ScriptDir/ScriptBootFile
is the entrypoint of worker, localpath or central storage(e.g. s3).FrameworkType/FrameworkVersion
specifies the base container image of worker.Parameters
specifies the environment of worker.pros
cons
proposals: Add pod template support for workers
proposal 1: just pod template
And deprecate the current ScriptDir.
examples and discussions
joint-inference-service
so in this proposal, the example of joint-inference in here would be:
something need to discuss for joint inference service:
federated-learning-job
so in this proposal, the example of joint-inference in here would be:
incremental-learning-job
the common problem:
deployment support
using the feature of deployment:
alternative: using replicaSet
daemonset support
use case: