ocp-power-automation / rsct-operator

Apache License 2.0
1 stars 3 forks source link

Remove hard dependency on coreos #14

Closed mkumatag closed 1 week ago

mkumatag commented 2 months ago

I see following NodeSelector label, wondering if we can remove it to make this code work even on the non-open shift environment as well?

https://github.com/ocp-power-automation/rsct-operator/blob/d6a985ad4f41d73e99662e146bfce849713c4ea9/internal/controller/daemonset.go#L112

varad-ahirwadkar commented 1 month ago

I have removed the node selector - node.openshift.io/os_id: rhcos. For the verification, deployed a Kubernetes cluster on a CentOS 9 node. The operator was successfully deployed, but after creating the custom resource, the RSCT container is not in running state due to a startup probe failure.

mkumatag commented 1 month ago

what exact error we are hitting with?

varad-ahirwadkar commented 1 month ago

Describe pod only shows: Startup probe failed

Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Normal   Pulling    14m (x154 over 26h)     kubelet  Pulling image "quay.io/powercloud/rsct-ppc64le:latest"
  Warning  Unhealthy  4m34s (x9239 over 26h)  kubelet  Startup probe failed:

Actually rmcdomainstatus -s ctrmc -a IP command is not working inside the container

# kubectl exec -it rsct-fmgxw /bin/sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
sh-4.4# rmcdomainstatus
Usage:
    rmcdomainstatus [-h host] [-a argument] -g group_name
    rmcdomainstatus [-h host] [-a argument] -s subsystem_name
    rmcdomainstatus [-h host] [-a argument] -p subsystem_pid
sh-4.4# rmcdomainstatus -s ctrmc -a IP
sh-4.4# rmcdomainstatus -s ctrmc

From the host:

# rmcdomainstatus -s ctrmc -a IP

Management Domain Status: Management Control Points
  I A  0xcfba059f52e66b9d  0001  10.20.27.38 (C)
  I A  0xf8d41a4d2fe3c5ef  0002  10.20.27.35 (C)
  I A  0xe394ab4ad6443688  0003  10.20.27.105 (C)
mkumatag commented 1 month ago

can you please crosscheck the spec which is getting shipped here - https://github.com/ocp-power-automation/ocp4-playbooks/blob/cccf79fbe1f1938ae84ead9e4f71bd7093d2a4ec/playbooks/roles/ocp-customization/tasks/powervm_rmc.yaml#L29?

varad-ahirwadkar commented 1 month ago

There is no startupProbe or livenessProbe here powervm_rmc.yaml#L29. Without these probes, the deployment is successful on both OCP and Kubernetes clusters.

OCP cluster:

# oc get pods
NAME                                               READY   STATUS    RESTARTS   AGE
rsct-22bkc                                         1/1     Running   0          31m
rsct-b746x                                         1/1     Running   0          31m
rsct-operator-controller-manager-b6756b47f-2rf4h   2/2     Running   0          33m
rsct-qp7mv                                         1/1     Running   0          31m
rsct-vmw7x                                         1/1     Running   0          31m
rsct-x687z                                         1/1     Running   0          31m

k8s cluster (OS: CentOS Stream 9):

[root@rdr-varad-k8s ~]# kubectl get pods
NAME                                               READY   STATUS    RESTARTS   AGE
rsct-cchcf                                         1/1     Running   0          78s
rsct-operator-controller-manager-b6756b47f-s9pcc   2/2     Running   0          113s
mkumatag commented 1 month ago

okay, then we need to debug further and understand why these commands aren't executing properly in the container? use strace to check what it calls and which function call is not working properly.

varad-ahirwadkar commented 3 weeks ago

For CentOS or RHEL, the RSCT is already running on the hosts and needs to be reset to switch it from the host to the container using /opt/rsct/install/bin/recfgct.

varad-ahirwadkar commented 2 weeks ago

Able to run RSCT process inside the pod after disabling the RSCT process from the CentOS host.

  1. Stop the rsct process from host
    
    # rmcdomainstatus -s ctrmc -a IP

Management Domain Status: Management Control Points I A 0xf8d41a4d2fe3c5ef 0001 10.20.27.35 (C) I A 0xcfba059f52e66b9d 0002 10.20.27.38 (C) I A 0xe394ab4ad6443688 0003 10.20.27.105 (C)

/usr/sbin/rsct/bin/rmcctrl -z (stops the rsct daemons)

rmcdomainstatus -s ctrmc -a IP

0513-036 The request could not be passed to the ctrmc subsystem. Start the subsystem and try your command again.


2. Deploy RSCT operator

kubectl get pods

NAME READY STATUS RESTARTS AGE rsct-g8hqb 1/1 Running 0 6m35s rsct-operator-controller-manager-5cfcc58887-jzl9p 2/2 Running 0 79s


3. Check if RSCT has been running on pod

kubectl exec -it rsct-g8hqb /bin/sh

kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead. sh-4.4# rmcdomainstatus -s ctrmc -a IP

Management Domain Status: Management Control Points I A 0xe394ab4ad6443688 0003 10.20.27.105 (C) I A 0xf8d41a4d2fe3c5ef 0002 10.20.27.35 (C) I A 0xcfba059f52e66b9d 0001 10.20.27.38 (C)