vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.42k stars 1.37k forks source link

When executing velero backup delete, object data of ${bucket_name}/kopia/${namespace} cannot be deleted from s3 (minio) #7716

Open nemanja1209 opened 2 months ago

nemanja1209 commented 2 months ago

As #6916 is closed without real solution I'm opening it again.

Here is some additional information about my config:

Velero server version: 1.13

Backup storage location:

Spec:
  Access Mode:  ReadWrite
  Config:
    Insecure Skip TLS Verify:  true
    Region:                    minio
    s3ForcePathStyle:          true
    s3Url:                     https://my.domain.com:9000
  Default:                     true
  Object Storage:
    Bucket:  velero
  Provider:  aws

Backup storage repository:

Spec:
  Backup Storage Location:  default
  Maintenance Frequency:    1h0m0s
  Repository Type:          kopia
  Restic Identifier:        s3:https://my.domain.com:9000/velero/restic/prometheus
  Volume Namespace:         prometheus

There is a backup schedule job:

NAME              STATUS    CREATED                          SCHEDULE      BACKUP TTL   LAST BACKUP   SELECTOR   PAUSED
bckp-prometheus   Enabled   2024-04-15 15:09:14 +0200 CEST   0 */6 * * *   7h0m0s       2h ago        <none>     false

It creates backup every 6 hours and TTL is 7 hours. That means that at some point I have 2 backups (for 1 hour) and after that period only one is left.

Output of velero backup get is good, there is only one backup (or two if you execute the command one hour after creation of the last backup):

NAME                             STATUS      ERRORS   WARNINGS   CREATED                          EXPIRES   STORAGE LOCATION   SELECTOR
bckp-prometheus-20240419060013   Completed   0        0          2024-04-19 08:00:31 +0200 CEST   4h        default            <none>

This PVC backed up is big around 30GB. After 5 days, Minio showed around 58GB used space and when I entered the bucket, there were files from the first backup. At this time, after 7 days, Minio shows aroung 77GB used space.

Currently, there are about 5300 files in the bucket. Maybe the more than a half are from the first backup. It seems that the first backup lasted for about 15min, because the first file is created at 20:00 and the last one at 20:16.

For this 7 days, there were executed around 26 backups (every 6 hours) and each one (except first one) didn't last for more than 2min. (difference between first and last created file). These files are also still present in the bucket.

So, the rest 25 backups hold half of the files in the bucket and first one backup other half (rough estimation). It is like the bucket do some kind of versioning (incremental backup). For this bucket, object locking (versioning) is disabled.

When I disable schedule, delete all backups and wait for a few days, Kopia snapshots are still there in the bucket under following path ${bucket_name}/kopia/${namespace}. Everything within ${bucket_name}/backups/ is deleted as expected.

thomasklosinsky commented 2 months ago

Same thing here with v1.13.2. Did backup, deleted backup, waited 3 days, no kopia maintenance done, well, no files deleted... Just activated debug log level and will send the corresponding logs tomorrow.

nemanja1209 commented 2 months ago

I guess there is a problem with Kopia maintenance jobs that should be executed automatically in the background. Quick cycle Interval is customised (for the test) to 2mins. When the timer goes off, there is only displayed next run: now and nothing is happening. When I execute maintenance manually, everything works as expected. Timer is reseted to 2mins and when goes off there is again next run: now. Same situation is with the Full cycle job. At the Recent Maintenance Runs there is only one record (when repository is initialised).

Owner: default@default
Quick Cycle:
  scheduled: true
  interval: 2m0s
  next run: now
Full Cycle:
  scheduled: true
  interval: 24h0m0s
  next run: 2024-04-26 17:58:39 CEST (in 8h22m58s)
Log Retention:
  max count:       10000
  max age of logs: 720h0m0s
  max total size:  1.1 GB
Object Lock Extension: disabled
Recent Maintenance Runs:
  cleanup-epoch-manager:
    2024-04-25 17:58:40 CEST (0s) SUCCESS
  cleanup-logs:
    2024-04-25 17:58:40 CEST (0s) SUCCESS
  full-rewrite-contents:
    2024-04-25 17:58:40 CEST (0s) SUCCESS
  snapshot-gc:
    2024-04-25 17:58:39 CEST (0s) SUCCESS
nemanja1209 commented 2 months ago

Workaround: I created cronjob that executes kopia maintenance —full every 2 hours. It is actually simulating manual command execution that I have confirmed it works as expected.

thomasklosinsky commented 2 months ago

how do you start the command manually?

kubectl exec -n velero velero-66b9bc65c7-tp7jt -- "kopia maintenance --full" Defaulted container "velero" out of: velero, velero-velero-plugin-for-aws (init), velero-velero-plugin-for-csi (init) error: Internal error occurred: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "e4648fa3b4fd7089d18e4809b644fd56b406cad3018cbdcb9a549ba6eb085279": OCI runtime exec failed: exec failed: unable to start container process: exec: "kopia maintenance --full": executable file not found in $PATH: unknown

thomasklosinsky commented 2 months ago

Seems as if kopia full maintenance is working here:

time="2024-05-06T12:24:33Z" level=info msg="Running maintenance on backup repository" backupRepo=velero/backuptest-default-kopia-msvvv logSource="pkg/controller/backup_repository_controller.go:289" time="2024-05-06T12:24:35Z" level=info msg="Running full maintenance..." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]" time="2024-05-06T12:24:35Z" level=info msg="Running full maintenance..." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]" time="2024-05-06T12:24:35Z" level=info msg="Rewriting contents from short packs..." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]" time="2024-05-06T12:24:35Z" level=info msg="Total bytes rewritten 0 B" logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]" time="2024-05-06T12:24:35Z" level=info msg="Not enough time has passed since previous successful Snapshot GC. Will try again next time." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]" time="2024-05-06T12:24:35Z" level=info msg="Skipping blob deletion because not enough time has passed yet (59m59s left)." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]" time="2024-05-06T12:24:35Z" level=info msg="Cleaned up 0 logs." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]" time="2024-05-06T12:24:35Z" level=info msg="Cleaning up old index blobs which have already been compacted..." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]" time="2024-05-06T12:24:35Z" level=info msg="Finished full maintenance." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]" time="2024-05-06T12:24:35Z" level=info msg="Finished full maintenance." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"

Lyndon-Li commented 2 months ago

For data security consideration, Kopia repo keeps the unused data for some time before fully deleting it. Therefore, you need to keep the maintenance running completely until it is safe for the repo to delete the unused data.

insider89 commented 1 month ago

Have the same issue. Storage keeps growing indefinitely. Even after I delete namespace on the cluster and delete velero backup, after multiple days the kopia/{namespace} directory on s3 is still present and new files are written there(even if namespace does exist on the cluster anymore). Old files don't delete either. storage keeps growing.

insider89 commented 1 month ago

The namespace staging-oleksandr-besu-fe6e was delete more than 24h ago. backuprepositories crd still present. I've delete all backups that can be related to staging-oleksandr-besu-fe6e namespace. In the bucket I still see new files appear every backup schedule run:

Screenshot 2024-05-17 at 11 04 41

My namespace:

NAME                   STATUS   AGE
btp                    Active   64d
btp-platform           Active   34d
clustermanager         Active   422d
default                Active   422d
development            Active   64d
ingress                Active   422d
kube-node-lease        Active   422d
kube-public            Active   422d
kube-system            Active   422d
shared                 Active   422d
staging-besu1n1-a5d2   Active   22d
staging-ext1n1-46d0    Active   22d
velero                 Active   104d

My backuprepositories:

NAME                                              AGE   REPOSITORY TYPE
staging-besu1n1-a5d2-default-kopia-q9lgw          41h   kopia
staging-besu3n1-c06e-default-kopia-q8cg6          18h   kopia
staging-besu4n1-d840-default-kopia-p6mp8          17h   kopia
staging-besu5n1-a944-default-kopia-45lpx          16h   kopia
staging-ext1n1-46d0-default-kopia-djvg5           41h   kopia
staging-oleksandr-besu-fe6e-default-kopia-96bqm   40h   kopia

Log related my namespace from velero:

Defaulted container "velero" out of: velero, velero-plugin-for-aws (init)
time="2024-05-17T07:15:22Z" level=info msg="Running maintenance on backup repository" backupRepo=velero/staging-oleksandr-besu-fe6e-default-kopia-96bqm logSource="pkg/controller/backup_repository_controller.go:290"
time="2024-05-17T07:47:32Z" level=info msg="Processing item" backup=velero/hourly-20240517074722 logSource="pkg/backup/backup.go:365" name=staging-oleksandr-besu-fe6e-default-kopia-96bqm namespace=velero progress= resource=backuprepositories.velero.io
time="2024-05-17T07:47:32Z" level=info msg="Backing up item" backup=velero/hourly-20240517074722 logSource="pkg/backup/item_backupper.go:179" name=staging-oleksandr-besu-fe6e-default-kopia-96bqm namespace=velero resource=backuprepositories.velero.io
time="2024-05-17T07:47:32Z" level=info msg="Backed up 1118 items out of an estimated total of 1129 (estimate will change throughout the backup)" backup=velero/hourly-20240517074722 logSource="pkg/backup/backup.go:405" name=staging-oleksandr-besu-fe6e-default-kopia-96bqm namespace=velero progress= resource=backuprepositories.velero.io
time="2024-05-17T07:50:23Z" level=info msg="invoking DeleteItemAction plugins" item=staging-oleksandr-besu-fe6e-default-kopia-96bqm logSource="internal/delete/delete_item_action_handler.go:116" namespace=velero

Why it's keep writing files to the folder on the bucket, even if ns doesn't exist? And why it's still not deleted? As I understand kopia run full maintenance every 24h. It mean we should not see this folder on the bucket at all?

nemanja1209 commented 1 month ago

I think you have to delete all backuprepositories if you want to stop writing new files to kopia repository. Also, my guess is that files come from maintenance job:

kubectl describe backuprepositories.velero.io mq-default-kopia-8tk9v 
Name:         mq-default-kopia-8tk9v
Namespace:    velero
Labels:       velero.io/repository-type=kopia
              velero.io/storage-location=default
              velero.io/volume-namespace=mq
Annotations:  <none>
API Version:  velero.io/v1
Kind:         BackupRepository
Metadata:
  Creation Timestamp:  2024-05-17T07:26:50Z
  Generate Name:       mq-default-kopia-
  Generation:          4
  Managed Fields:
    API Version:  velero.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName:
        f:labels:
          .:
          f:velero.io/repository-type:
          f:velero.io/storage-location:
          f:velero.io/volume-namespace:
      f:spec:
        .:
        f:backupStorageLocation:
        f:maintenanceFrequency:
        f:repositoryType:
        f:resticIdentifier:
        f:volumeNamespace:
      f:status:
        .:
        f:lastMaintenanceTime:
        f:message:
        f:phase:
    Manager:         velero-server
    Operation:       Update
    Time:            2024-05-17T08:27:15Z
  Resource Version:  3848152
  UID:               6013f45b-4d61-46e6-83aa-67a4ab507898
Spec:
  Backup Storage Location:  default
  **Maintenance Frequency:    1h0m0s
  Repository Type:          kopia**
  Restic Identifier:        s3:https://name.mydomain.com:9000/velero-dev/restic/mq
  Volume Namespace:        mq
Status:
  Last Maintenance Time:  2024-05-17T07:26:51Z

Can you paste output of following command (when backup is working and files are not being deleted): kopia maintenance info

Before that step, you have to connect to kopia repository, command example: kopia repository connect s3 --endpoint name.mydomain.com:{portifneeded} --bucket bucket-name --access-key your-key --secret-access-key your-secret --disable-tls-verification --prefix kopia/namespace/ --password 'static-passw0rd'

Last argument password can be found at velero namespace under the secret velero-repo-credentials.

insider89 commented 1 month ago

Hey @nemanja1209. I have dynamic namespaces on my envs, so I'have a lot of backuprepositories all the time. I thought when I delete backup which tie to current backuprepository, backuprepository should be deleted automatically with next kopia full maintenance run? If it's not true, it means I need to cleanup already deleted namespaces manually all the time?

apiVersion: velero.io/v1
kind: BackupRepository
metadata:
  creationTimestamp: "2024-05-15T15:06:34Z"
  generateName: staging-oleksandr-besu-fe6e-default-kopia-
  generation: 46
  labels:
    velero.io/repository-type: kopia
    velero.io/storage-location: default
    velero.io/volume-namespace: staging-oleksandr-besu-fe6e
  name: staging-oleksandr-besu-fe6e-default-kopia-96bqm
  namespace: velero
  resourceVersion: "198127217"
  uid: 5f125d9d-0d3d-46fd-8329-fef7c54e77d3
spec:
  backupStorageLocation: default
  maintenanceFrequency: 1h0m0s
  repositoryType: kopia
  resticIdentifier: s3:s3-eu-central-1.amazonaws.com/mybucket/restic/staging-oleksandr-besu-fe6e
  volumeNamespace: staging-oleksandr-besu-fe6e
status:
  lastMaintenanceTime: "2024-05-17T10:18:35Z"
  phase: Ready

> kopia maintenance info
Owner: default@default
Quick Cycle:
  scheduled: true
  interval: 1h0m0s
  next run: now
Full Cycle:
  scheduled: true
  interval: 24h0m0s
  next run: 2024-05-17 19:10:59 EEST (in 5h29m28s)
Log Retention:
  max count:       10000
  max age of logs: 720h0m0s
  max total size:  1.1 GB
Object Lock Extension: disabled
Recent Maintenance Runs:
  full-drop-deleted-content:
    2024-05-16 19:10:59 EEST (0s) SUCCESS
  full-rewrite-contents:
    2024-05-15 19:10:59 EEST (0s) SUCCESS
  snapshot-gc:
    2024-05-16 19:10:59 EEST (0s) SUCCESS
    2024-05-15 19:10:59 EEST (0s) SUCCESS
  cleanup-epoch-manager:
    2024-05-16 19:11:00 EEST (0s) SUCCESS
    2024-05-15 19:10:59 EEST (0s) SUCCESS
  cleanup-logs:
    2024-05-16 19:11:00 EEST (0s) SUCCESS
    2024-05-15 19:10:59 EEST (0s) SUCCESS
  full-delete-blobs:
    2024-05-16 19:11:00 EEST (0s) SUCCES

BTW, can we adjust those value via velero helm chart, I didn't find how to do it.

nemanja1209 commented 1 month ago

Backuprepository is CRD. CRDs are never deleted. Deleting a CRD automatically deletes all of the CRD's contents across all namespaces in the cluster. Consequently, Helm will not delete CRDs.

Also, my guess about not deleting files is that maintenance job is not executing in the way it should. After connection to repoistory you can try to execute this command manually ( to speed up maintenance schedule ) and check if the files from previous backups are still there.

kopia maintenance run --full --safety=none

Expecting result is to have only files from backups that are not expired yet. All other files should be removed. At our environment this workaround works as expected, the only problem is that we had to create cron job with this command to execute in some scheduled time. As you can see

Quick Cycle:
  scheduled: true
  interval: 1h0m0s
  next run: now

suggests that maintenance should be done immediately but it is not

insider89 commented 1 month ago

@nemanja1209 any tweak how to run full maintenance?


kopia maintenance run --full --safety=none
ERROR maintenance must be run by designated user: default@default

Didn't find another way how to set the owner.

nemanja1209 commented 1 month ago

try with this one

kopia maintenance set --owner=me

dont change anything, me means that current user will get permission

insider89 commented 1 month ago

Thx. Now I can run full maintenance. The log of execution:

Running full maintenance...
Looking for active contents...
Looking for unreferenced contents...
GC found 72 unused contents (18.6 MB)
GC found 0 unused contents that are too recent to delete (0 B)
GC found 0 in-use contents (0 B)
GC found 57 in-use system-contents (32.6 KB)
Rewriting contents from short packs...
Total bytes rewritten 18.6 MB
Found safe time to drop indexes: 2024-05-17 14:20:39.478219 +0300 EEST
Dropping contents deleted before 2024-05-17 14:20:39.478219 +0300 EEST
Looking for unreferenced blobs...
Deleted total 73 unreferenced blobs (18.9 MB)
Compacting an eligible uncompacted epoch...
Cleaning up no-longer-needed epoch markers...
Attempting to compact a range of epoch indexes ...
Cleaning up unneeded epoch markers...
Cleaning up old index blobs which have already been compacted...
Cleaned up 0 logs.
Finished full maintenance.

Indd some files deleted from the bucket, but still see a lot of old _log.* files. As I understand it's not possible to cleanup them with kopia?

Screenshot 2024-05-17 at 14 23 26

The only way to cleanup it fully:

  1. Delete backup repository CRD
  2. Delete folder on s3 bucket(kopia/namespace ?)

And this process can be done only manually, am I right?

nemanja1209 commented 1 month ago

That is the only way as I know. I would highly appreciate if someone can tell to us better way :)

WRKT commented 1 month ago

Hello, I have a question : How to run the kopia maintenance --full command if installed velero using chart ?

Like @thomasklosinsky, i tried with kubectl but I got this error : image

I'm facing the issue that it's been a month, and kopia repository in s3 still increases its size

nemanja1209 commented 1 month ago

I did it with Kopia CLI. Firstly, I installed it.

image

Then you need to connect to repository and change repository owner to me as described in messages above.

 kopia repository connect s3 --endpoint name.mydomain.com:{portifneeded} --bucket bucket-name --access-key your-key  --secret-access-key your-secret --disable-tls-verification --prefix kopia/namespace/ --password 'static-passw0rd'

kopia maintenance set --owner=me

kopia maintenance --full
WRKT commented 1 month ago

So in my scenario, let's say I have a pod that manages all velero operation that can interact with velero API. Called it "toolbox", I should install kopia CLI too in this pod so I can connect to my repo ?

nemanja1209 commented 1 month ago

We have the similar situation, but only difference is that we have VM instead of pod. So I guess you can try that way

WRKT commented 1 month ago

Ok thanks! I'll try that ! Thank you so much ;)

Do you confirm that in your case, when you ran the kopia maintenance command, it cleans the kopia/ folder in the s3 bucket ?

nemanja1209 commented 1 month ago

Yes, but it depends on the arguments you enter. If you set safety=none > $ kopia maintenance run --full --safety=none than it should clean repository immediately. If you run just kopia mainenance run --full Kopia have some mechanism that calulates is it safe to delete all unused objects so you will need to wait some more time

WRKT commented 1 month ago

OK thanks, I'll try it right away!

WRKT commented 1 month ago

I ran the command, and I get this error : image

Do you know how to fix it ? It's my first time using kopia

nemanja1209 commented 1 month ago

Please post your kopia repository connect command

WRKT commented 1 month ago

i messed up, i didn't put the correct endpoint, but now I get a different error, this is the command that I runned : kopia repository connect s3 --endpoint s3.eu-west-2.wasabisys.com --bucket xxxxxxxxxxx --access-key xxxxxxxxxxxxxx --secret-access-key xxxxxxxxxxxxxxxxxxxxxx --prefix=kopia/namespace/ --password 'XXXXXXXXXXX'

And below is the error : image

nemanja1209 commented 1 month ago

Could be wrong prefix. Did you put literally word namespace or did you replace with real name of namespace? It should be the path like at s3

WRKT commented 1 month ago

I put the real name of namespace, I just replaced it to paste here as example. so I should start like s3://bucket/prefix ?

nemanja1209 commented 1 month ago

No, I asked about this line prefix=kopia/namespace/ namespace should be changed to real one

WRKT commented 1 month ago

Yes I changed it to the real one like in my s3 bucket

nemanja1209 commented 1 month ago

Can you check the path of Kopia file at the bucket? This is example IMG_4884

So, the bucket name is testbucket and prefix is prefix kopia/prometheus/ also I think character “=“ should not be there

WRKT commented 1 month ago

Ok, I'll try to remove it, I putted it following kopia documentation. Thanks for the insight