Add liveness (or readiness) probe for ctdb container.

gd commented 3 years ago

Sachin wants to do research about this (will get in touch with John about it).

spuiuk commented 2 years ago

Notes:

A liveness probe indicates that the container is running. a readiness probe indicates that the container is ready to service requests. ie. liveness probe indicates the state of the container and the readiness probe indicates the state of the service running in that container. A container with liveness probe set to true and readiness probe returning false indicates that the container is up and running but the service is not yet ready to service requests.
by default, a liveness probe will check PID 1 in the container to determine if the the container is alive. This is fine for cases where only one process runs on the container.
by default, kubernetes will assume that the container is ready to receive traffic as long at the liveliness probe returns true.

# Lookup existing Probes 
$ kubectl edit sts smbshare3
..

..
# No probes defined for ctdb container
      - args:
        - run
        - ctdbd
        - --setup=smb_ctdb
        - --setup=ctdb_config
        - --setup=ctdb_etc
        - --setup=ctdb_nodes
        env:
        - name: SAMBA_CONTAINER_ID
          value: smbshare3
        - name: SAMBACC_CONFIG
          value: /etc/container-config/config.json
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: SAMBACC_CTDB
          value: ctdb-is-experimental
        image: quay.io/samba.org/samba-server:latest
        imagePullPolicy: Always
        name: ctdb
        resources: {}
..
# Both liveness and readiness probe defined for smbd container
      - args:
        - run
        - smbd
        - --setup=users
        - --setup=smb_ctdb
        env:
        - name: SAMBA_CONTAINER_ID
          value: smbshare3
        - name: SAMBACC_CONFIG
          value: /etc/container-config/config.json
        image: quay.io/samba.org/samba-server:latest
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 445
          timeoutSeconds: 1
        name: samba
        ports:
        - containerPort: 445
          name: smb
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 445
          timeoutSeconds: 1

# Login into the ctdb container in a clustered pod created
[sprabhu@fedora bin]$ kubectl exec -it smbshare3-0 -c ctdb -- /bin/bash

# Process list on the ctdb share.
[root@smbshare3-0 /]# ps -fax
    PID TTY      STAT   TIME COMMAND
    533 pts/0    Ss     0:00 /bin/bash
    635 pts/0    R+     0:00  \_ ps -fax
     89 ?        Ss     0:00 /usr/sbin/smbd --foreground --log-stdout --no-process-group
    105 ?        S      0:00  \_ /usr/sbin/smbd --foreground --log-stdout --no-process-group
    106 ?        S      0:00  \_ /usr/sbin/smbd --foreground --log-stdout --no-process-group
     83 ?        Ss     0:00 /usr/bin/python3 /usr/local/bin/samba-container ctdb-manage-nodes --hostname=smbshare3-0 --take-node-number-from-hostname=after-last-dash
     39 ?        SLs    0:03 /usr/sbin/ctdbd --interactive
     45 ?        S      0:00  \_ /usr/libexec/ctdb/ctdb-eventd -P 39 -S 9
     81 ?        S      0:00  \_ /usr/sbin/ctdbd --interactive
     94 ?        S      0:00      \_ /usr/libexec/ctdb/ctdb_mutex_fcntl_helper /var/lib/ctdb/shared/RECOVERY
      1 ?        Ss     0:00 /pause

References:

https://developers.redhat.com/blog/2020/11/10/you-probably-need-liveness-and-readiness-probes#example_1__a_static_file_server__nginx_

spuiuk commented 2 years ago

Test 1

Set readinessProbe in the following manner for the ctdb container

        readinessProbe:
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
          exec:
            command:
            - /bin/sh
            - -c
            - "ctdb nodestatus |grep 'OK (THIS NODE)'"

Exec into a pod/ctdb container and disable the ctdb service

[sprabhu@fedora bin]$ kubectl exec -it smbshare3-0 -c ctdb -- /bin/bash
[root@smbshare3-0 /]# ctdb nodestatus
pnn:0 10.244.1.37      OK (THIS NODE)
[root@smbshare3-0 /]# ctdb disable
[root@smbshare3-0 /]# ctdb nodestatus
pnn:0 10.244.1.37      DISABLED (THIS NODE)

We see the following effect in the cluster

[sprabhu@fedora tests]$ kubectl get pods -w 
NAME                               READY   STATUS    RESTARTS   AGE
samba-ad-server-86b7dd9856-m46sh   1/1     Running   0          43h
smbshare3-0                        3/3     Running   0          28m
smbshare3-1                        3/3     Running   0          31m
smbshare3-0                        2/3     Running   0          28m

smbshare3-0 goes from Ready 3/3 to 2/3

[sprabhu@fedora tests]$ kubectl describe pod smbshare3-0
..
  Warning  Unhealthy  43s (x120 over 45m)  kubelet            Readiness probe failed:

At this point, the smbshare service should have stopped sending any service requests to the pod. However it doesn't reboot the pod automatically. This requires the liveliness probe to be setup instead.

spuiuk commented 2 years ago

From the ctdb man page. The status can be any of the following

       OK
           This node is healthy and fully functional. It hosts public addresses to provide services.

       DISCONNECTED
           This node is not reachable by other nodes via the private network. It is not currently participating in the cluster. It does not host public
           addresses to provide services. It might be shut down.

       DISABLED
           This node has been administratively disabled. This node is partially functional and participates in the cluster. However, it does not host
           public addresses to provide services.

       UNHEALTHY
           A service provided by this node has failed a health check and should be investigated. This node is partially functional and participates in
           the cluster. However, it does not host public addresses to provide services. Unhealthy nodes should be investigated and may require an
           administrative action to rectify.

       BANNED
           CTDB is not behaving as designed on this node. For example, it may have failed too many recovery attempts. Such nodes are banned from
           participating in the cluster for a configurable time period before they attempt to rejoin the cluster. A banned node does not host public
           addresses to provide services. All banned nodes should be investigated and may require an administrative action to rectify.

       STOPPED
           This node has been administratively exclude from the cluster. A stopped node does no participate in the cluster and does not host public
           addresses to provide services. This state can be used while performing maintenance on a node.

       PARTIALLYONLINE
           A node that is partially online participates in a cluster like a healthy (OK) node. Some interfaces to serve public addresses are down, but at
           least one interface is up. See also ctdb ifaces.

samba-in-kubernetes / samba-operator

Add liveness (or readiness) probe for ctdb container. #135

Test 1