stackabletech / agent

Stackable Agent - a kubelet written in Rust which uses systemd as its backend
Apache License 2.0
15 stars 9 forks source link

Remove systemd units without a corresponding pod #312

Closed siegfriedweber closed 3 years ago

siegfriedweber commented 3 years ago

Description

On startup the systemd units in the system-stackable slice are compared to the pods assigned to this node. If a systemd unit is as expected then it is kept and the Stackable Agent will take ownership again in a later stage. If there is no corresponding pod or the systemd unit differs from the pod specification then it is removed and the Stackable Agent will create a new systemd unit afterwards.

Closes #180

Test

It is not possible to test this change with the agent-integration-tests because systemd units must be prepared and the Stackable Agent must be started afterwards, which is not possible over the Kubernetes API. Therefore it must be tested manually.

The following script can be used for manual testing:

#!/bin/sh

setup_stackable_repository() {
echo Setup Stackable repository

echo -n "
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: repositories.stable.stackable.de
spec:
  group: stable.stackable.de
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                repo_type:
                  type: string
                properties:
                  type: object
                  additionalProperties:
                    type: string
  scope: Namespaced
  names:
    plural: repositories
    singular: repository
    kind: Repository
    shortNames:
    - repo
" | kubectl apply -f -

echo -n "
apiVersion: stable.stackable.de/v1
kind: Repository
metadata:
  name: integration-test-repository
  namespace: default
spec:
  repo_type: StackableRepo
  properties:
    url: https://raw.githubusercontent.com/stackabletech/integration-test-repo/main/
" | kubectl apply -f -
}

setup_unit_with_pod() {
echo Setup unit with pod

echo -n "[Unit]
Description=default-cleanup-test-ok-noop-service
StartLimitIntervalSec=0

[Service]
Environment=\"KUBECONFIG=/root/.kube/config\"
ExecStart=/opt/stackable/packages/noop-service-1.0.0/noop-service-1.0.0/start.sh
RemainAfterExit=no
Restart=always
RestartSec=2
Slice=system-stackable.slice
StandardError=journal
StandardOutput=journal
TimeoutStopSec=30

[Install]
WantedBy=multi-user.target" > /lib/systemd/system/default-cleanup-test-ok-noop-service.service

systemctl daemon-reload
systemctl enable default-cleanup-test-ok-noop-service.service
systemctl start default-cleanup-test-ok-noop-service.service

echo "
apiVersion: v1
kind: Pod
metadata:
  name: cleanup-test-ok
spec:
  containers:
    - name: noop-service
      image: noop-service:1.0.0
      command:
        - noop-service-1.0.0/start.sh
  nodeName: localhost
  nodeSelector:
    kubernetes.io/arch: stackable-linux
  tolerations:
    - key: kubernetes.io/arch
      operator: Equal
      value: stackable-linux
" | kubectl apply -f -
}

setup_unit_without_pod() {
echo Setup unit without pod

echo -n "[Unit]
Description=default-cleanup-test-no-pod-noop-service
StartLimitIntervalSec=0

[Service]
Environment=\"KUBECONFIG=/root/.kube/config\"
ExecStart=/opt/stackable/packages/noop-service-1.0.0/noop-service-1.0.0/start.sh
RemainAfterExit=no
Restart=always
RestartSec=2
Slice=system-stackable.slice
StandardError=journal
StandardOutput=journal
TimeoutStopSec=30

[Install]
WantedBy=multi-user.target" > /lib/systemd/system/default-cleanup-test-no-pod-noop-service.service

systemctl daemon-reload
systemctl enable default-cleanup-test-no-pod-noop-service.service
systemctl start default-cleanup-test-no-pod-noop-service.service
}

setup_unit_with_unexpected_content() {
echo Setup unit with pod

echo -n "[Unit]
Description=You did not expect this, did you?

[Service]
ExecStart=/opt/stackable/packages/noop-service-1.0.0/noop-service-1.0.0/start.sh
Slice=system-stackable.slice

[Install]
WantedBy=multi-user.target" > /lib/systemd/system/default-cleanup-test-unexpected-content-noop-service.service

systemctl daemon-reload
systemctl enable default-cleanup-test-unexpected-content-noop-service.service
systemctl start default-cleanup-test-unexpected-content-noop-service.service

echo "
apiVersion: v1
kind: Pod
metadata:
  name: cleanup-test-unexpected-content
spec:
  containers:
    - name: noop-service
      image: noop-service:1.0.0
      command:
        - noop-service-1.0.0/start.sh
  nodeName: localhost
  nodeSelector:
    kubernetes.io/arch: stackable-linux
  tolerations:
    - key: kubernetes.io/arch
      operator: Equal
      value: stackable-linux
" | kubectl apply -f -
}

setup_unit_with_terminating_pod() {
echo Setup unit with terminating pod

echo -n "[Unit]
Description=default-cleanup-test-terminating-noop-service
StartLimitIntervalSec=0

[Service]
Environment=\"KUBECONFIG=/root/.kube/config\"
ExecStart=/opt/stackable/packages/noop-service-1.0.0/noop-service-1.0.0/start.sh
RemainAfterExit=no
Restart=always
RestartSec=2
Slice=system-stackable.slice
StandardError=journal
StandardOutput=journal
TimeoutStopSec=30

[Install]
WantedBy=multi-user.target" > /lib/systemd/system/default-cleanup-test-terminating-noop-service.service

systemctl daemon-reload
systemctl enable default-cleanup-test-terminating-noop-service.service
systemctl start default-cleanup-test-terminating-noop-service.service

echo "
apiVersion: v1
kind: Pod
metadata:
  name: cleanup-test-terminating
spec:
  containers:
    - name: noop-service
      image: noop-service:1.0.0
      command:
        - noop-service-1.0.0/start.sh
  nodeName: localhost
  nodeSelector:
    kubernetes.io/arch: stackable-linux
  tolerations:
    - key: kubernetes.io/arch
      operator: Equal
      value: stackable-linux
" | kubectl apply -f -

kubectl delete pod cleanup-test-terminating &
}

setup_stackable_repository
setup_unit_with_pod
setup_unit_without_pod
setup_unit_with_unexpected_content
setup_unit_with_terminating_pod

The log output of the Stackable Agent should be:

[2021-09-24T11:26:39Z INFO  stackable_agent::provider::cleanup] The systemd unit [default-cleanup-test-unexpected-content-noop-service.service] will be removed because it differs from the corresponding pod specification.
    expected content:
    [Unit]
    Description=default-cleanup-test-unexpected-content-noop-service
    StartLimitIntervalSec=0

    [Service]
    Environment="KUBECONFIG=/root/.kube/config"
    ExecStart=/opt/stackable/packages/noop-service-1.0.0/noop-service-1.0.0/start.sh
    RemainAfterExit=no
    Restart=always
    RestartSec=2
    Slice=system-stackable.slice
    StandardError=journal
    StandardOutput=journal
    TimeoutStopSec=30

    [Install]
    WantedBy=multi-user.target

    actual content:
    [Unit]
    Description=You did not expect this, did you?

    [Service]
    ExecStart=/opt/stackable/packages/noop-service-1.0.0/noop-service-1.0.0/start.sh
    Slice=system-stackable.slice

    [Install]
    WantedBy=multi-user.target
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::cleanup] The systemd unit [default-cleanup-test-terminating-noop-service.service] will be removed because the corresponding pod is terminating.
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::cleanup] The systemd unit [default-cleanup-test-ok-noop-service.service] will be kept because a corresponding pod exists.
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::cleanup] The systemd unit [default-cleanup-test-no-pod-noop-service.service] will be removed because no corresponding pod exists.
[2021-09-24T11:26:39Z INFO  warp::server] TlsServer::run; addr=127.0.0.1:3000
[2021-09-24T11:26:39Z INFO  warp::server] listening on https://127.0.0.1:3000
[2021-09-24T11:26:39Z INFO  krator::runtime] Got a watch restart. Resyncing queue...
[2021-09-24T11:26:39Z INFO  krator::runtime] Finished resync of objects.
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::downloading] Looking for package: noop-service:1.0.0 in known repositories
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::downloading] Package noop-service:1.0.0 has already been downloaded to "/opt/stackable/packages/_download", continuing with installation
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::downloading] Looking for package: noop-service:1.0.0 in known repositories
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::downloading] Package noop-service:1.0.0 has already been downloaded to "/opt/stackable/packages/_download", continuing with installation
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::installing] Package noop-service:1.0.0 has already been installed
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::installing] Package noop-service:1.0.0 has already been installed
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::creating_service] Creating service unit for service default-cleanup-test-ok
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::creating_service] Creating service unit for service default-cleanup-test-unexpected-content
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::starting] Starting systemd unit [default-cleanup-test-unexpected-content-noop-service.service]
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::starting] Enabling systemd unit [default-cleanup-test-unexpected-content-noop-service.service]
[2021-09-24T11:26:40Z INFO  stackable_agent::provider::states::pod::terminated] Pod default-cleanup-test-terminating was terminated

Leftover

The implementation of SystemDUnit was adapted as far as necessary but a complete refactoring would be required. This will be done in #244.

Review Checklist

siegfriedweber commented 3 years ago

Systemd units where the corresponding pod is terminating, are removed now.

I extended the test script with the function setup_unit_with_terminating_pod.

State before starting the Stackable Agent:

# kubectl get pod cleanup-test-terminating
NAME                       READY   STATUS        RESTARTS   AGE
cleanup-test-terminating   0/1     Terminating   0          14s

# systemctl status default-cleanup-test-terminating-noop-service.service
● default-cleanup-test-terminating-noop-service.service - default-cleanup-test-terminating-noop-service
   Loaded: loaded (/usr/lib/systemd/system/default-cleanup-test-terminating-noop-service.service; enabled; vendor preset: disabled)
  Drop-In: /run/systemd/system/default-cleanup-test-terminating-noop-service.service.d
           └─zzz-lxc-service.conf
   Active: active (running) since Fri 2021-09-24 11:52:10 UTC; 24s ago
 Main PID: 27865 (start.sh)
   CGroup: /system.slice/system-stackable.slice/default-cleanup-test-terminating-noop-service.service
           ├─27865 /bin/sh /opt/stackable/packages/noop-service-1.0.0/noop-service-1.0.0/start.sh
           └─27866 sleep 1d

Sep 24 11:52:10 centos7 systemd[1]: Started default-cleanup-test-terminating-noop-service.
Sep 24 11:52:10 centos7 start.sh[27865]: test-service started

Corresponding log output of the Stackable Agent:

[2021-09-24T11:54:43Z INFO  stackable_agent::provider::cleanup] The systemd unit [default-cleanup-test-terminating-noop-service.service] will be removed because the corresponding pod is terminating.
[2021-09-24T11:54:45Z INFO  stackable_agent::provider::states::pod::terminated] Pod default-cleanup-test-terminating was terminated

State after starting the Stackable Agent:

# kubectl get pod cleanup-test-terminating
Error from server (NotFound): pods "cleanup-test-terminating" not found

# systemctl status default-cleanup-test-terminating-noop-service.service
Unit default-cleanup-test-terminating-noop-service.service could not be found.