openshift / cluster-etcd-operator

Operator to manage the lifecycle of the etcd members of an OpenShift cluster
Apache License 2.0
95 stars 127 forks source link

ETCD-636: Add etcd-backup-server sidecar #1325

Closed Elbehery closed 2 weeks ago

Elbehery commented 1 month ago

This is rebased version of https://github.com/openshift/cluster-etcd-operator/pull/1301

openshift-ci-robot commented 1 month ago

@Elbehery: This pull request references ETCD-635 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.18.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1325): >This is rebased version of https://github.com/openshift/cluster-etcd-operator/pull/1301 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Elbehery

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-etcd-operator/blob/master/OWNERS)~~ [Elbehery] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 1 month ago

@Elbehery: This pull request references ETCD-636 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1325): >This is rebased version of https://github.com/openshift/cluster-etcd-operator/pull/1301 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
Elbehery commented 1 month ago

/jira refresh

openshift-ci-robot commented 1 month ago

@Elbehery: This pull request references ETCD-636 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1325#issuecomment-2308403275): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
Elbehery commented 1 month ago

/label tide/merge-method-squash

Elbehery commented 1 month ago

A new test has been execute now, with OCP 4.17-ec3 cluster and this PR atop

Backup from master node etcd-ip-10-0-88-52.ec2.internal{}

sh-5.1# ls -l /var/backup/etcd/current-backup/
total 115720
rw------. 1 root root 118415392 Aug 24 15:40 snapshot_2024-08-24_154000.db
rw------. 1 root root     77293 Aug 24 15:40 static_kuberesources_2024-08-24_154000.tar.gz*

Backup from master node   etcd-ip-10-0-56-114.ec2.internal

sh-5.1# ls -l /var/backup/etcd/
total 0
drwxr-xr-x. 2 root root 96 Aug 24 15:40 current-backup
sh-5.1# ls -l /var/backup/etcd/current-backup/
total 115604
-rw-------. 1 root root 118296608 Aug 24 15:40 snapshot_2024-08-24_154000.db
-rw-------. 1 root root     77291 Aug 24 15:40 static_kuberesources_2024-08-24_154000.tar.gz 

Backup from master node   etcd-ip-10-0-48-53.ec2.internal

sh-5.1#  ls -l /var/backup/etcd/
total 0
drwxr-xr-x. 2 root root 96 Aug 24 15:40 current-backup
sh-5.1#  ls -l /var/backup/etcd/current-backup/
total 115556
-rw-------. 1 root root 118247456 Aug 24 15:40 snapshot_2024-08-24_154000.db
-rw-------. 1 root root     77281 Aug 24 15:40 static_kuberesources_2024-08-24_154000.tar.gz 
Elbehery commented 1 month ago

/retest-required

Elbehery commented 1 month ago

Tested the Backup Pruning functionality

CR used

apiVersion: config.openshift.io/v1alpha1
kind: Backup
metadata:
  name: default
  annotations:
    default: "true"
spec:
  etcd:
    schedule: "*/5 * * * *"
    timeZone: "UTC"
    retentionPolicy:
      retentionType: RetentionNumber
      retentionNumber:
        maxNumberOfBackups: 3

backups

sh-5.1# ls -l /var/backup/etcd/
total 0
drwxr-xr-x. 2 root root 96 Aug 26 00:35 2024-08-26_003500
drwxr-xr-x. 2 root root 96 Aug 26 00:40 2024-08-26_004000
drwxr-xr-x. 2 root root 96 Aug 26 00:45 2024-08-26_004500
sh-5.1# 
sh-5.1# 
sh-5.1# ls -l /var/backup/etcd/
total 0
drwxr-xr-x. 2 root root 96 Aug 26 00:40 2024-08-26_004000
drwxr-xr-x. 2 root root 96 Aug 26 00:45 2024-08-26_004500
drwxr-xr-x. 2 root root 96 Aug 26 00:50 2024-08-26_005000

logs

I0826 00:50:00.822533       1 prune.go:217] found backup folders: [Name=[2024-08-26_003500] SizeBytes=[101444980] ModTime=[2024-08-26 00:35:01.438205464 +0000 UTC] Name=[2024-08-26_004000] SizeBytes=[116768116] ModTime=[2024-08-26 00:40:00.628250372 +0000 UTC] Name=[2024-08-26_004500] SizeBytes=[126713204] ModTime=[2024-08-26 00:45:00.882332755 +0000 UTC] Name=[2024-08-26_005000] SizeBytes=[63667572] ModTime=[2024-08-26 00:50:00.82139288 +0000 UTC]]
I0826 00:50:00.822568       1 prune.go:166] deleting [/var/backup/etcd/2024-08-26_003500]...
I0826 00:50:00.833594       1 prune.go:172] pruning successful
2024/08/26 00:50:00 [tasker] task [*/5 * * * *][#1] ran successfully
Elbehery commented 1 month ago

/retest-required

Elbehery commented 1 month ago

This commit add etcd-backup-server disabling functionality .

Tested successfully as shown below :-

Backups on etcd-ip-10-0-109-64.ec2.internal

oc rsh -c etcd-backup-server -n openshift-etcd pod/etcd-ip-10-0-109-64.ec2.internal 
sh-5.1# ls -l /var/backup/etcd/
total 0
drwxr-xr-x. 2 root root 96 Aug 26 15:10 2024-08-26_151000
drwxr-xr-x. 2 root root 96 Aug 26 15:15 2024-08-26_151500
drwxr-xr-x. 2 root root 96 Aug 26 15:20 2024-08-26_152000

Pruning on etcd-ip-10-0-109-64.ec2.internal

I0826 15:15:01.494205       1 prune.go:217] found backup folders: [Name=[2024-08-26_145500] SizeBytes=[97046143] ModTime=[2024-08-26 14:55:01.497759289 +0000 UTC] Name=[2024-08-26_150000] SizeBytes=[97046143] ModTime=[2024-08-26 15:00:00.773084381 +0000 UTC] Name=[2024-08-26_151000] SizeBytes=[103231103] ModTime=[2024-08-26 15:10:01.192713639 +0000 UTC] Name=[2024-08-26_151500] SizeBytes=[103681663] ModTime=[2024-08-26 15:15:01.493000739 +0000 UTC]]
I0826 15:15:01.494275       1 prune.go:166] deleting [/var/backup/etcd/2024-08-26_145500]...
I0826 15:15:01.494409       1 prune.go:172] pruning successful

Backups on etcd-ip-10-0-53-32.ec2.internal

oc rsh -c etcd-backup-server -n openshift-etcd pod/etcd-ip-10-0-53-32.ec2.internal
sh-5.1# ls -l /var/backup/etcd/
total 0
drwxr-xr-x. 2 root root 96 Aug 26 15:05 2024-08-26_150500
drwxr-xr-x. 2 root root 96 Aug 26 15:10 2024-08-26_151000
drwxr-xr-x. 2 root root 96 Aug 26 15:15 2024-08-26_151500

Pruning on etcd-ip-10-0-53-32.ec2.internal

I0826 15:20:01.120093       1 prune.go:217] found backup folders: [Name=[2024-08-26_150500] SizeBytes=[101723795] ModTime=[2024-08-26 15:05:01.445474695 +0000 UTC] Name=[2024-08-26_151000] SizeBytes=[103124627] ModTime=[2024-08-26 15:10:00.726808898 +0000 UTC] Name=[2024-08-26_151500] SizeBytes=[103575187] ModTime=[2024-08-26 15:15:00.982240516 +0000 UTC] Name=[2024-08-26_152000] SizeBytes=[103575187] ModTime=[2024-08-26 15:20:01.118677999 +0000 UTC]]
I0826 15:20:01.120121       1 prune.go:166] deleting [/var/backup/etcd/2024-08-26_150500]...
I0826 15:20:01.129402       1 prune.go:172] pruning successful

Backup on etcd-ip-10-0-99-75.ec2.internal

oc rsh -c etcd-backup-server -n openshift-etcd pod/etcd-ip-10-0-99-75.ec2.internal
sh-5.1# ls -l /var/backup/etcd/
total 0
drwxr-xr-x. 2 root root 96 Aug 26 15:05 2024-08-26_150500
drwxr-xr-x. 2 root root 96 Aug 26 15:10 2024-08-26_151000
drwxr-xr-x. 2 root root 96 Aug 26 15:15 2024-08-26_151500
sh-5.1# 

Pruning on etcd-ip-10-0-99-75.ec2.internal

I0826 15:25:01.497547       1 prune.go:217] found backup folders: [Name=[2024-08-26_151000] SizeBytes=[103202459] ModTime=[2024-08-26 15:10:00.812867119 +0000 UTC] Name=[2024-08-26_151500] SizeBytes=[103653019] ModTime=[2024-08-26 15:15:01.06948225 +0000 UTC] Name=[2024-08-26_152000] SizeBytes=[103653019] ModTime=[2024-08-26 15:20:01.31899995 +0000 UTC] Name=[2024-08-26_152500] SizeBytes=[103653019] ModTime=[2024-08-26 15:25:01.495566387 +0000 UTC]]
I0826 15:25:01.497589       1 prune.go:166] deleting [/var/backup/etcd/2024-08-26_151000]...
I0826 15:25:01.507569       1 prune.go:172] pruning successful

Backup CR

apiVersion: config.openshift.io/v1alpha1
kind: Backup
metadata:
   name: testbackup
spec:
   etcd:
      schedule: "20 4 * * *"
      timeZone: "UTC"
      retentionPolicy:
         retentionType: RetentionNumber
         retentionNumber:
            maxNumberOfBackups: 5
      pvcName: etcd-backup-pvc

logs from etcd-ip-10-0-109-64.ec2.internal

oc logs -f pod/etcd-ip-10-0-109-64.ec2.internal -n openshift-etcd -c etcd-backup-server 
I0826 15:28:54.503080       1 backupserver.go:71] backup-server is disabled

backups from etcd-ip-10-0-109-64.ec2.internal

oc rsh -c etcd-backup-server -n openshift-etcd pod/etcd-ip-10-0-109-64.ec2.internal 
sh-5.1# ls -l /var/backup/etcd/
total 0
drwxr-xr-x. 2 root root 96 Aug 26 15:15 2024-08-26_151500
drwxr-xr-x. 2 root root 96 Aug 26 15:20 2024-08-26_152000
drwxr-xr-x. 2 root root 96 Aug 26 15:25 2024-08-26_152500

logs from etcd-ip-10-0-53-32.ec2.internal

oc logs -f pod/etcd-ip-10-0-53-32.ec2.internal -n openshift-etcd -c etcd-backup-server 
I0826 15:30:38.673710       1 backupserver.go:71] backup-server is disabled

backups from etcd-ip-10-0-53-32.ec2.internal

oc rsh -c etcd-backup-server -n openshift-etcd pod/etcd-ip-10-0-53-32.ec2.internal 
sh-5.1# ls -l /var/backup/etcd/
total 0
drwxr-xr-x. 2 root root 96 Aug 26 15:15 2024-08-26_151500
drwxr-xr-x. 2 root root 96 Aug 26 15:20 2024-08-26_152000
drwxr-xr-x. 2 root root 96 Aug 26 15:25 2024-08-26_152500

logs from etcd-ip-10-0-99-75.ec2.internal

oc logs -f etcd-ip-10-0-99-75.ec2.internal -n openshift-etcd -c etcd-backup-server  
I0826 15:27:00.988012       1 backupserver.go:71] backup-server is disabled

backups from etcd-ip-10-0-99-75.ec2.internal

oc rsh -c etcd-backup-server -n openshift-etcd pod/etcd-ip-10-0-99-75.ec2.internal  
sh-5.1# ls -l /var/backup/etcd/
total 0
drwxr-xr-x. 2 root root 96 Aug 26 15:15 2024-08-26_151500
drwxr-xr-x. 2 root root 96 Aug 26 15:20 2024-08-26_152000
drwxr-xr-x. 2 root root 96 Aug 26 15:25 2024-08-26_152500
Elbehery commented 1 month ago

/retest-required

Elbehery commented 1 month ago

/retest-required

Elbehery commented 1 month ago

/retest-required

Elbehery commented 1 month ago

/retest-required

Elbehery commented 1 month ago

/retest-required

openshift-ci-robot commented 4 weeks ago

@Elbehery: This pull request references ETCD-661 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1325): >This is rebased version of https://github.com/openshift/cluster-etcd-operator/pull/1301 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 4 weeks ago

@Elbehery: This pull request references ETCD-636 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1325): >This is rebased version of https://github.com/openshift/cluster-etcd-operator/pull/1301 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
Elbehery commented 4 weeks ago

this PR has been split into

openshift-merge-robot commented 3 weeks ago

PR needs rebase.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
openshift-ci[bot] commented 3 weeks ago

@Elbehery: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link true /test verify
ci/prow/unit 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link true /test unit
ci/prow/images 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link true /test images
ci/prow/verify-deps 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link true /test verify-deps
ci/prow/e2e-agnostic-ovn 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link true /test e2e-agnostic-ovn
ci/prow/e2e-aws-ovn-single-node 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link true /test e2e-aws-ovn-single-node
ci/prow/e2e-operator-fips 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link false /test e2e-operator-fips
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-operator 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link true /test e2e-operator
ci/prow/e2e-aws-etcd-certrotation 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-agnostic-ovn-upgrade 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link true /test e2e-agnostic-ovn-upgrade
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-aws-etcd-recovery 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link false /test e2e-aws-etcd-recovery
ci/prow/e2e-aws-ovn-etcd-scaling 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link true /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-serial 170732d25d9b3d7ce095aecd56578f4cbe21e2ce link true /test e2e-aws-ovn-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
Elbehery commented 2 weeks ago

closing this in favor of

https://github.com/openshift/cluster-etcd-operator/pull/1304 https://github.com/openshift/cluster-etcd-operator/pull/1305 https://github.com/openshift/cluster-etcd-operator/pull/1306