kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.
Apache License 2.0
51 stars 73 forks source link

Dynamic roles which can technically support any potential frameworks #144

Open Jeffwan opened 3 years ago

Jeffwan commented 3 years ago

Every framework's implementation is pretty close and I am thinking we actually don't need that many controllers/operator. If we can support custom roles, most popular framework can adapt to it.

The major challenge is to let controller know how it can construct the environment of cluster spec. If there's a way to represent it in annotation/label etc, that might be a feasible way. I am also open to other options

zw0610 commented 3 years ago

I think it's a great idea to support custom roles. My personal experience tells me there exist many situation what we need to further extend the definition of roles. Moreover, we shall not limit the customization to the pod environment. Instead, it might be a good idea to let user to 'decorator' the pod template for each customized role.

Without changing too much to the architecture of the contemporary design of kubeflow operators, I would suggest the following approaches:

  1. We can DecoratePod(temple *corev1.PodTemplate, rtype commonv1.ReplicaType) method in PodReconcilerInterface and let ConstructPod (ReconcilerPod -> CreatePod -> ConstructPod) to call the DecoratePod method just before return
  2. When launching the manager, user can specify if customization server address by ReplicaType like /opt/kubeflow/tf-operator.v1 --decorator CWorker,10.1.2.9:8080,PSX,/var/psx.sock and these info will be registered in the manager.
  3. In the implementation of BasePodReconciler (which implements the base functionality of PodReconcilerInterface), it just do nothing to the template *corev1.PodTemplate if user does not specifies the corresponding ReplicaType, otherwise it shall call the registered decorator server to update the pod template.
Jeffwan commented 3 years ago

Yeah. I am thinking how we can insert "clusterSpec" environment for different frameworks?

{
"worker": ["worker0.example.com:2222","worker1.example.com:2222","worker2.example.com:2222"],
"ps": ["ps0.example.com:2222","ps1.example.com:2222"]
}

different framework have different settings on this part. The most easiest way is to have some predefined templates in the code.

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: multi-worker
  labels: 
    framework: tensorflow ->  CustomJob can leverage this label to determine how it injects the environment. 
                                           -> We can even put typology format here to further simplify controller work but it will be buggy.
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
....