intel / intel-device-plugins-for-kubernetes

Collection of Intel device plugins for Kubernetes
Apache License 2.0
38 stars 205 forks source link

Inconsistencies between different installation methods #1061

Open eero-t opened 2 years ago

eero-t commented 2 years ago

I noticed some confusing inconsistencies when checking things under deployments folder and different installation methods...

GPU plugin is installed on different nodes depending on used method:

$ git grep feature.node.kubernetes.io | grep gpu
gpu_plugin/overlays/nfd_labeled_nodes/add-nodeselector-intel-gpu.yaml:        feature.node.kubernetes.io/pci-0300_8086.present: "true"
nfd/overlays/node-feature-rules/node-feature-rules-openshift.yaml:        "intel.feature.node.kubernetes.io/gpu": "true"
nfd/overlays/node-feature-rules/node-feature-rules.yaml:        "intel.feature.node.kubernetes.io/gpu": "true"
operator/samples/deviceplugin_v1_gpudeviceplugin.yaml:    intel.feature.node.kubernetes.io/gpu: "true"

IMHO at least the last two first options (relying on NFD), should be the same, not differ:

Whereas inconsistencies in namespace usage between installation methods could mean that user ends accidentally with multiple instances of the same thing running in different namespaces:

$ cd deployments; git grep namespace: | grep -v -e { -e "'" -e % -e _test
dsa_plugin/overlays/dsa_initcontainer/dsa-config.yaml:  namespace: inteldeviceplugins-system
dsa_plugin/overlays/namespace_kube-system/add-namespace-kube-system.yaml:  namespace: kube-system
fpga_admissionwebhook/base/manager_webhook_patch.yaml:  namespace: system
fpga_admissionwebhook/certmanager/certificate.yaml:  namespace: system
fpga_admissionwebhook/certmanager/certificate.yaml:  namespace: system
fpga_admissionwebhook/default/kustomization.yaml:namespace: intelfpgawebhook-system
fpga_admissionwebhook/manager/manager.yaml:  namespace: system
fpga_admissionwebhook/rbac/role_binding.yaml:  namespace: system
fpga_admissionwebhook/webhook/kustomizeconfig.yaml:namespace:
fpga_admissionwebhook/webhook/manifests.yaml:      namespace: system
fpga_admissionwebhook/webhook/service.yaml:  namespace: system
fpga_plugin/base/intel-fpga-plugin-daemonset.yaml:  namespace: system
fpga_plugin/base/role_binding.yaml:  namespace: system
fpga_plugin/overlays/af/kustomization.yaml:namespace: intelfpgaplugin-system
fpga_plugin/overlays/region/kustomization.yaml:namespace: intelfpgaplugin-system
fpga_plugin/overlays/region/mode-region.yaml:  namespace: system
gpu_plugin/overlays/fractional_resources/resource-cluster-role-binding.yaml:  namespace: default
gpu_plugin/overlays/namespace_kube-system/add-namespace-kube-system.yaml:  namespace: kube-system
iaa_plugin/overlays/iaa_initcontainer/iaa-config.yaml:  namespace: inteldeviceplugins-system
nfd/overlays/node-feature-discovery/node-feature-discovery-openshift.yaml:  namespace: openshift-nfd
operator/certmanager/certificate.yaml:  namespace: system
operator/certmanager/certificate.yaml:  namespace: system
operator/crd/bases/deviceplugin.intel.com_dlbdeviceplugins.yaml:                  namespace:
operator/crd/bases/deviceplugin.intel.com_dsadeviceplugins.yaml:                  namespace:
operator/crd/bases/deviceplugin.intel.com_fpgadeviceplugins.yaml:                  namespace:
operator/crd/bases/deviceplugin.intel.com_gpudeviceplugins.yaml:                  namespace:
operator/crd/bases/deviceplugin.intel.com_iaadeviceplugins.yaml:                  namespace:
operator/crd/bases/deviceplugin.intel.com_qatdeviceplugins.yaml:                  namespace:
operator/crd/bases/deviceplugin.intel.com_sgxdeviceplugins.yaml:                  namespace:
operator/crd/kustomizeconfig.yaml:namespace:
operator/crd/patches/webhook_in_fpgadeviceplugins.yaml:        namespace: system
operator/crd/patches/webhook_in_gpudeviceplugins.yaml:        namespace: system
operator/crd/patches/webhook_in_qatdeviceplugins.yaml:        namespace: system
operator/default/kustomization.yaml:namespace: inteldeviceplugins-system
operator/default/manager_auth_proxy_patch.yaml:  namespace: system
operator/default/manager_webhook_patch.yaml:  namespace: system
operator/device/dlb/dlb.yaml:  namespace: inteldeviceplugins-system
operator/device/dsa/dsa.yaml:  namespace: inteldeviceplugins-system
operator/device/fpga/fpga.yaml:  namespace: inteldeviceplugins-system
operator/device/gpu/gpu.yaml:  namespace: inteldeviceplugins-system
operator/device/qat/qat.yaml:  namespace: inteldeviceplugins-system
operator/device/sgx/sgx.yaml:  namespace: inteldeviceplugins-system
operator/manager/manager.yaml:  namespace: system
operator/manifests/bases/intel-device-plugins-operator.clusterserviceversion.yaml:  namespace: placeholder
operator/rbac/auth_proxy_role_binding.yaml:  namespace: system
operator/rbac/auth_proxy_service.yaml:  namespace: system
operator/rbac/leader_election_role_binding.yaml:  namespace: system
operator/rbac/role_binding.yaml:  namespace: system
operator/webhook/kustomizeconfig.yaml:namespace:
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/service.yaml:  namespace: system
sgx_admissionwebhook/base/manager_webhook_patch.yaml:  namespace: system
sgx_admissionwebhook/certmanager/certificate.yaml:  namespace: system
sgx_admissionwebhook/certmanager/certificate.yaml:  namespace: system
sgx_admissionwebhook/manager/manager.yaml:  namespace: system
sgx_admissionwebhook/overlays/default-with-certmanager/kustomization.yaml:namespace: intelsgxwebhook-system
sgx_admissionwebhook/webhook/kustomizeconfig.yaml:namespace:
sgx_admissionwebhook/webhook/manifests.yaml:      namespace: system
sgx_admissionwebhook/webhook/service.yaml:  namespace: system
sgx_plugin/overlays/epc-register/kustomization.yaml:namespace: kube-system
sgx_plugin/overlays/epc-register/service-account.yaml:  namespace: kube-system
sgx_plugin/overlays/epc-register/service-account.yaml:  namespace: kube-system
vpu_plugin/overlays/namespace_kube-system/add-namespace-kube-system.yaml:  namespace: kube-system

For example, when using operator, GPU plugin goes to inteldeviceplugins-system ns, when using GPU plugin README apply method, it goes to default ns, and when using gpu_plugin/overlays/namespace_kube-system overlay, it goes to kube-system ns. Some of the other plugins, use different namespaces, but that's not consistent either.

GPU plugin also uses different service account & roles depending on installation method (this is from repo root):

$ git grep gpu-manager
Makefile:       $(CONTROLLER_GEN) rbac:roleName=gpu-manager-role paths="./cmd/gpu_plugin/..." output:dir=deployments/operator/rbac
charts/operator/templates/operator.yaml:  name: inteldeviceplugins-gpu-manager-role
deployments/operator/rbac/gpu_manager_role.yaml:  name: gpu-manager-role
pkg/controllers/gpu/controller.go:      serviceAccountName = "gpu-manager-sa"
pkg/controllers/gpu/controller.go:                              Name:      "gpu-manager-sa",
pkg/controllers/gpu/controller.go:                              Name:      "gpu-manager-rolebinding",
pkg/controllers/gpu/controller.go:                                      Name:      "gpu-manager-sa",
pkg/controllers/gpu/controller.go:                              Name:     "inteldeviceplugins-gpu-manager-role",

$ git grep resource-reader
cmd/gpu_plugin/README.md:serviceaccount/resource-reader-sa created
cmd/gpu_plugin/README.md:clusterrole.rbac.authorization.k8s.io/resource-reader created
cmd/gpu_plugin/README.md:clusterrolebinding.rbac.authorization.k8s.io/resource-reader-rb created
deployments/gpu_plugin/overlays/fractional_resources/add-serviceaccount.yaml:      serviceAccountName: resource-reader-sa
deployments/gpu_plugin/overlays/fractional_resources/kustomization.yaml:  - resource-reader-sa.yaml
deployments/gpu_plugin/overlays/fractional_resources/resource-cluster-role-binding.yaml:  name: resource-reader-rb
deployments/gpu_plugin/overlays/fractional_resources/resource-cluster-role-binding.yaml:  name: resource-reader-sa
deployments/gpu_plugin/overlays/fractional_resources/resource-cluster-role-binding.yaml:  name: resource-reader
deployments/gpu_plugin/overlays/fractional_resources/resource-cluster-role.yaml:  name: resource-reader
deployments/gpu_plugin/overlays/fractional_resources/resource-reader-sa.yaml:  name: resource-reader-sa
eero-t commented 2 years ago

Tuomas fixed the deployment rule inconsistency in https://github.com/intel/intel-device-plugins-for-kubernetes/pull/1169 and Ukri is fixing overlay object names inconsistency in https://github.com/intel/intel-device-plugins-for-kubernetes/pull/1172.

As to remaining namespace inconsistency...

According to @ukri, it's better that manually installed GPU plugin, and operator operating automatically on k8s GPU plugin components, are in separate namespaces. But at least the namespace overlay and install doc example should use the same namespace.

Namespace can be given as kubectl/kustomize command line option. Maybe it's best if namespace overlay is deprecated (+ eventually removed), and install doc uses e.g. "intel-gpu" as the namespace in kubectl example?

(I suggested "intel-gpu" namespace to indicate that pods in it are Intel related components unlike generic stuff in "kube-system", which is used by the current namespace overlay.)

tkatila commented 1 year ago

I grepped the namespaces from deployments again:

$ grep -sr namespace: * | grep "namespace: ." | grep -v -e " system" -e inteldeviceplugins-system
fpga_admissionwebhook/default/kustomization.yaml:namespace: intelfpgawebhook-system
fpga_plugin/overlays/af/kustomization.yaml:namespace: intelfpgaplugin-system
fpga_plugin/overlays/region/kustomization.yaml:namespace: intelfpgaplugin-system
gpu_plugin/overlays/fractional_resources/gpu-manager-rolebinding.yaml:  namespace: default
nfd/overlays/node-feature-discovery/node-feature-discovery-openshift.yaml:  namespace: openshift-nfd
operator/manifests/bases/intel-device-plugins-operator.clusterserviceversion.yaml:  namespace: placeholder
sgx_admissionwebhook/overlays/default-with-certmanager/kustomization.yaml:namespace: intelsgxwebhook-system
sgx_plugin/overlays/epc-register/service-account.yaml:  namespace: kube-system
sgx_plugin/overlays/epc-register/service-account.yaml:  namespace: kube-system
sgx_plugin/overlays/epc-register/kustomization.yaml:namespace: kube-system
xpumanager_sidecar/kustomization.yaml:namespace: monitoring

Some suggestions:

eero-t commented 1 year ago

Suggestions look fine to me. @mythi ?