sustainable-computing-io / kepler-operator

Kepler Operator
Apache License 2.0
25 stars 26 forks source link

Exporter Pod Crashback loop off / Kepler fails to start #76

Closed AdrianHammond closed 1 year ago

AdrianHammond commented 1 year ago

I installed the 0.4.2 version of the Kepler Community Operator. Prior to installing operator I ran the cluster prereq manifest.

I created a Kepler instance in Kepler N/S that I created. The exporter pods are running but Kepler does start, pods eventually crashback loop off.

Below is snipit from container logs

E0629 08:52:41.444006 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>: unknown (get pods)
W0629 08:52:42.371497 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:kepler:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0629 08:52:42.371583 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:kepler:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
W0

Found that the kepler-clusterrole was missing "pods" resource

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: kepler-clusterrole
  uid: 2868b3c1-5797-4bf2-b20a-f395e3fd280d
  resourceVersion: '525973'
  creationTimestamp: '2023-06-29T08:16:21Z'
  managedFields:
    - manager: Mozilla
      operation: Update
      apiVersion: rbac.authorization.k8s.io/v1
      time: '2023-06-29T08:45:05Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:rules': {}
rules:
  - verbs:
      - get
      - watch
      - list
    apiGroups:
      - ''
    resources:
      - nodes/metrics
      - nodes/proxy
      - nodes/stats

I added pods resource and Kepler now starts.

629 09:06:37.882420 1 bcc_attacher.go:171] Successfully load eBPF module with option: [-DMAP_SIZE=10240 -DNUM_CPUS=12 -DSET_GROUP_ID]
I0629 09:06:37.916850 1 exporter.go:226] Started Kepler in 1.546074588s
husky-parul commented 1 year ago

Thanks for reporting this. FYI if you are installing kepler-operator you don't need to apply MachineConfigs. The operator takes care of it.

AdrianHammond commented 1 year ago

Thanks for letting me know that. A.

On Thu, 29 Jun 2023 at 11:26, Parul Singh @.***> wrote:

Thanks for reporting this. FYI if you are installing kepler-operator you don't need to apply MachineConfigs. The operator takes care of it.

— Reply to this email directly, view it on GitHub https://github.com/sustainable-computing-io/kepler-operator/issues/76#issuecomment-1612801760, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADA72H6V4J5O4V66U3EXYETXNVJS5ANCNFSM6AAAAAAZYHBIJE . You are receiving this because you authored the thread.Message ID: @.***>

--

Adrian Hammond, FBCS, CITP

Chief Architect

CTO Organisation

Red Hat https://www.redhat.com/

@.*** M: 07342072031

IM preference: Slack

husky-parul commented 1 year ago

Related PR: https://github.com/sustainable-computing-io/kepler-operator/pull/77 Related Issue: https://github.com/sustainable-computing-io/kepler-operator/issues/75

rootfs commented 1 year ago

with the -kernel-source-dir option to use pre-installed kernel sources, we don't need to install machineconfigs any more.

rootfs commented 1 year ago

@AdrianHammond this message "system:serviceaccount:kepler:kepler-sa" cannot list resource "pods" in API group is due to the changes in https://github.com/sustainable-computing-io/kepler/pull/635. Upcoming operator release will pick up the kepler changes

husky-parul commented 1 year ago

@AdrianHammond changes are available in the latest and 0.5.0 tags

AdrianHammond commented 1 year ago

Thanks @husky-parul - just tested and worked okay. Thank you