samba-in-kubernetes / samba-operator

An operator for a Samba as a service on PVCs in kubernetes
Apache License 2.0
117 stars 24 forks source link

Add liveness (or readiness) probe for ctdb container. #135

Closed gd closed 2 years ago

gd commented 3 years ago

Sachin wants to do research about this (will get in touch with John about it).

spuiuk commented 2 years ago

Notes:

# Lookup existing Probes 
$ kubectl edit sts smbshare3
..

..
# No probes defined for ctdb container
      - args:
        - run
        - ctdbd
        - --setup=smb_ctdb
        - --setup=ctdb_config
        - --setup=ctdb_etc
        - --setup=ctdb_nodes
        env:
        - name: SAMBA_CONTAINER_ID
          value: smbshare3
        - name: SAMBACC_CONFIG
          value: /etc/container-config/config.json
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: SAMBACC_CTDB
          value: ctdb-is-experimental
        image: quay.io/samba.org/samba-server:latest
        imagePullPolicy: Always
        name: ctdb
        resources: {}
..
# Both liveness and readiness probe defined for smbd container
      - args:
        - run
        - smbd
        - --setup=users
        - --setup=smb_ctdb
        env:
        - name: SAMBA_CONTAINER_ID
          value: smbshare3
        - name: SAMBACC_CONFIG
          value: /etc/container-config/config.json
        image: quay.io/samba.org/samba-server:latest
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 445
          timeoutSeconds: 1
        name: samba
        ports:
        - containerPort: 445
          name: smb
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 445
          timeoutSeconds: 1
# Login into the ctdb container in a clustered pod created
[sprabhu@fedora bin]$ kubectl exec -it smbshare3-0 -c ctdb -- /bin/bash

# Process list on the ctdb share.
[root@smbshare3-0 /]# ps -fax
    PID TTY      STAT   TIME COMMAND
    533 pts/0    Ss     0:00 /bin/bash
    635 pts/0    R+     0:00  \_ ps -fax
     89 ?        Ss     0:00 /usr/sbin/smbd --foreground --log-stdout --no-process-group
    105 ?        S      0:00  \_ /usr/sbin/smbd --foreground --log-stdout --no-process-group
    106 ?        S      0:00  \_ /usr/sbin/smbd --foreground --log-stdout --no-process-group
     83 ?        Ss     0:00 /usr/bin/python3 /usr/local/bin/samba-container ctdb-manage-nodes --hostname=smbshare3-0 --take-node-number-from-hostname=after-last-dash
     39 ?        SLs    0:03 /usr/sbin/ctdbd --interactive
     45 ?        S      0:00  \_ /usr/libexec/ctdb/ctdb-eventd -P 39 -S 9
     81 ?        S      0:00  \_ /usr/sbin/ctdbd --interactive
     94 ?        S      0:00      \_ /usr/libexec/ctdb/ctdb_mutex_fcntl_helper /var/lib/ctdb/shared/RECOVERY
      1 ?        Ss     0:00 /pause

References:

spuiuk commented 2 years ago

Test 1

Set readinessProbe in the following manner for the ctdb container

        readinessProbe:
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
          exec:
            command:
            - /bin/sh
            - -c
            - "ctdb nodestatus |grep 'OK (THIS NODE)'"

Exec into a pod/ctdb container and disable the ctdb service

[sprabhu@fedora bin]$ kubectl exec -it smbshare3-0 -c ctdb -- /bin/bash
[root@smbshare3-0 /]# ctdb nodestatus
pnn:0 10.244.1.37      OK (THIS NODE)
[root@smbshare3-0 /]# ctdb disable
[root@smbshare3-0 /]# ctdb nodestatus
pnn:0 10.244.1.37      DISABLED (THIS NODE)

We see the following effect in the cluster

[sprabhu@fedora tests]$ kubectl get pods -w 
NAME                               READY   STATUS    RESTARTS   AGE
samba-ad-server-86b7dd9856-m46sh   1/1     Running   0          43h
smbshare3-0                        3/3     Running   0          28m
smbshare3-1                        3/3     Running   0          31m
smbshare3-0                        2/3     Running   0          28m

smbshare3-0 goes from Ready 3/3 to 2/3

[sprabhu@fedora tests]$ kubectl describe pod smbshare3-0
..
  Warning  Unhealthy  43s (x120 over 45m)  kubelet            Readiness probe failed:

At this point, the smbshare service should have stopped sending any service requests to the pod. However it doesn't reboot the pod automatically. This requires the liveliness probe to be setup instead.

spuiuk commented 2 years ago

From the ctdb man page. The status can be any of the following

       OK
           This node is healthy and fully functional. It hosts public addresses to provide services.

       DISCONNECTED
           This node is not reachable by other nodes via the private network. It is not currently participating in the cluster. It does not host public
           addresses to provide services. It might be shut down.

       DISABLED
           This node has been administratively disabled. This node is partially functional and participates in the cluster. However, it does not host
           public addresses to provide services.

       UNHEALTHY
           A service provided by this node has failed a health check and should be investigated. This node is partially functional and participates in
           the cluster. However, it does not host public addresses to provide services. Unhealthy nodes should be investigated and may require an
           administrative action to rectify.

       BANNED
           CTDB is not behaving as designed on this node. For example, it may have failed too many recovery attempts. Such nodes are banned from
           participating in the cluster for a configurable time period before they attempt to rejoin the cluster. A banned node does not host public
           addresses to provide services. All banned nodes should be investigated and may require an administrative action to rectify.

       STOPPED
           This node has been administratively exclude from the cluster. A stopped node does no participate in the cluster and does not host public
           addresses to provide services. This state can be used while performing maintenance on a node.

       PARTIALLYONLINE
           A node that is partially online participates in a cluster like a healthy (OK) node. Some interfaces to serve public addresses are down, but at
           least one interface is up. See also ctdb ifaces.