kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.
Apache License 2.0
51 stars 73 forks source link

PR 135 makes replicaType updates hard #154

Open Jeffwan opened 3 years ago

Jeffwan commented 3 years ago

In this PR, https://github.com/kubeflow/common/pull/135/files, it changes rtype to ReplicaType.

However, it brings some challenges in operator upgrade.

-       expectationPodsKey := expectation.GenExpectationPodsKey(jobKey, rtype)
+       expectationPodsKey := expectation.GenExpectationPodsKey(jobKey, apiv1.ReplicaType(rtype))

Using this line as an example, controller retrieve rtype from pod label. It's lower case at this moment.

https://github.com/kubeflow/tf-operator/blob/ff5aaf10bd8a665cbce72dd2cd1d2c0cd06f329e/pkg/apis/tensorflow/v1/types.go#L77

https://github.com/kubeflow/tf-operator/blob/ff5aaf10bd8a665cbce72dd2cd1d2c0cd06f329e/pkg/apis/pytorch/v1/pytorchjob_types.go#L62

apiv1.ReplicaType(rtype) is inaccurate and meaningless because we define Upper case role like Master in the past,
and this conversion give us a non-exist ReplicaType because it's lower case master.

we have some references like below.

https://github.com/kubeflow/tf-operator/blob/ff5aaf10bd8a665cbce72dd2cd1d2c0cd06f329e/pkg/controller.v1/pytorch/pytorch.go#L46

I feel like all the role should be lower case, we make sure we don't use ReplicaType for comparison. Instead, still use strings.toLower(string(rtype)) which means PR 135 still not helpful?

/cc @MartinForReal @gaocegege @kubeflow/wg-training-leads

terrytangyuan commented 3 years ago

This indeed feel like unnecessary and complicate the implementations. Kubeflow/common should aim to be easy for downstream operators to implement.

MartinForReal commented 3 years ago

Maybe more helper functions would resolve comparison issue? I think we can add some type name convention by defining ReplicaType type and add more util functions.

Jeffwan commented 3 years ago

Maybe more helper functions would resolve comparison issue? I think we can add some type name convention by defining ReplicaType type and add more util functions.

@MartinForReal

Yeah, I add something like https://github.com/kubeflow/tf-operator/pull/1388/files#diff-7600abe3663b532d64fec9dae52fcb3b972fb034c53a45fe0030fc61776d1612R56

Did you use kubeflow/common in operator other than kubeflow training operators?