vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.79k stars 1.41k forks source link

Configurable Kopia Maintenance Interval #8364

Open kaovilai opened 3 weeks ago

kaovilai commented 3 weeks ago

Describe the problem/challenge you have

We want ability to configure maintenance interval to affect change to storage more quickly in some cases.

These can be configured in the repo-maintenance-job-configmap

Describe the solution you'd like

Anything else you would like to add:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

cc: @shubham-pampattiwar @weshayutin

kaovilai commented 3 weeks ago

I can be assigned this issue

sseago commented 3 weeks ago

@kaovilai I think what we want here is just a bool entry -- "alwaysUseFullMaintenance" or something. Auto is the default (so bool is false), which results in Kopia doing one full maint per day, and the rest are quick. When we set this to true we'll want every maintenance full. I don't think we ever want to always to quick -- that would mean data is never cleaned up.

kaovilai commented 3 weeks ago

Sure. Bool entry if that works for everyone.

Lyndon-Li commented 3 weeks ago

Could we clarify the scenarios why we want users to config the maintenance mode?

Basically, we don't want users to change the mode, because full maintenance and quick maintenance are very different from each other, they are designed to happen alternatively and the quick one is more frequent. Changing it manually may cause unexpected consequences:

Keeping the data in a reasonable time is a policy of Kopia to assure the system success to work, manually changing the maintenance mode could not result in the data to be deleted earlier. Therefore, we should have users know that the repo maintains data on its own phase, this is to assure the data safety.

Lyndon-Li commented 3 weeks ago

Another point:

Therefore, it is not safe nor necessary to add the maintenance mode into Unified Repository. At present, we let repo itself to decide how to do maintenance, including the mode and frequency, and offload the maintenance work totally to the repo itself

kaovilai commented 3 weeks ago

Full maintenance deletes data that quick maintenance doesn't. Running full maintenance too frequent causing the data to be deleted earlier unnecessarily and may result in potential data lose.

We probably want this to occur more often at least for testing/debugging. And we've been getting customer cases where they are saying maintenance does not actually work for them so backup expires but nothing is getting deleted.

https://hackmd.io/12AKVvCnRlmyBksgXJls5Q

Lyndon-Li commented 3 weeks ago

And we've been getting customer cases where they are saying maintenance does not actually work for them so backup expires but nothing is getting deleted

This may be as the expected behavior, e.g., the data may be referenced by other backups and should not deleted. Otherwise, we need to treat it as a bug and find the root cause before doing changes.

We probably want this to occur more often at least for testing/debugging

For this purpose, if the debugging happens on users' production environments, changing anything to the maintenance is still not recommended since this may result in users' data lose; if the testing/debugging happens in our dev environments, I think we can change the code locally, moreover, as mentioned above, there are many margins of sub tasks need to be adjusted, only changing the mode may not make it work as expected.

sseago commented 3 weeks ago

@Lyndon-Li "Full maintenance deletes data that quick maintenance doesn't. Running full maintenance too frequent causing the data to be deleted earlier unnecessarily and may result in potential data lose."

There shouldn't be any risk here, since kopia requires 4 separate full maintenance cycles at least four hours apart before it will remove any data. The concern is that with the default "once a day" full maintenance, it will be 24 hours at the earliest, but up to 48 hours once a blob is no longer referenced by a needed snapshot. We could reduce this window to 4+ hours if full maintenance ran more often. But even if you ran full maintenance constantly (which we wouldn't actually want) it shouldn't put the data at risk because kopia's built-in safety mechanisms require GC to mark a blob as safe to delete during two separate full maint cycles at least 4 hours apart.

sseago commented 3 weeks ago

I don't know if this is possible, but maybe there's a way to configure the kopia repo to do full maintenance more than once per day when velero runs maintenance with "mode=auto" -- that might be cleaner than a config to always run full, but I don't know whether that can be done. Then we could have behavior where full is done every 6 hours but quick every hour.

sseago commented 3 weeks ago

It looks like we probably can do that here: https://github.com/vmware-tanzu/velero/blob/db470a751b7a86c1f3e05d628a4694d84e6777ea/pkg/repository/udmrepo/kopialib/lib_repo.go#L595

        if overwriteFullMaintainInterval != time.Duration(0) {
            logger.Infof("Full maintenance interval change from %v to %v", p.FullCycle.Interval, overwriteFullMaintainInterval)
            p.FullCycle.Interval = overwriteFullMaintainInterval
        }

Maybe making this configurable is preferable to an "always use full maint" flag. Then we could recommend for users who want data to be deleted more quickly to set this to 6 or 12 hours instead of the default 24.

Lyndon-Li commented 3 weeks ago

@Lyndon-Li "Full maintenance deletes data that quick maintenance doesn't. Running full maintenance too frequent causing the data to be deleted earlier unnecessarily and may result in potential data lose."

There shouldn't be any risk here, since kopia requires 4 separate full maintenance cycles at least four hours apart before it will remove any data. The concern is that with the default "once a day" full maintenance, it will be 24 hours at the earliest, but up to 48 hours once a blob is no longer referenced by a needed snapshot. We could reduce this window to 4+ hours if full maintenance ran more often. But even if you ran full maintenance constantly (which we wouldn't actually want) it shouldn't put the data at risk because kopia's built-in safety mechanisms require GC to mark a blob as safe to delete during two separate full maint cycles at least 4 hours apart.

Yes, if you change the mode but not any margin, full maintenance doesn't make any effect but consume more resources; if you change the mode and also some margins, data risk will happen.

Lyndon-Li commented 3 weeks ago

It looks like we probably can do that here:

https://github.com/vmware-tanzu/velero/blob/db470a751b7a86c1f3e05d628a4694d84e6777ea/pkg/repository/udmrepo/kopialib/lib_repo.go#L595

      if overwriteFullMaintainInterval != time.Duration(0) {
          logger.Infof("Full maintenance interval change from %v to %v", p.FullCycle.Interval, overwriteFullMaintainInterval)
          p.FullCycle.Interval = overwriteFullMaintainInterval
      }

Maybe making this configurable is preferable to an "always use full maint" flag. Then we could recommend for users who want data to be deleted more quickly to set this to 6 or 12 hours instead of the default 24.

This looks more rational. The overwrite value could be set to udmrepo.RepoOptions through backupRepository config. Besides, I have two more suggestions on this direction:

  1. It is not necessary nor safe for users to set a specific time, p.FullCycle.Interval is also used and checked elsewhere in the Kopia code and so should be controlled within a reasonable value.
  2. We should keep the Unified Repo concepts even in the loose repo options, so we should avoid exposing Kopia parameters directly.
  3. Considering 1 and 2, I suggest we add the fastGC/eagerGC option. When this repo option is set, we overwrite the full maintenance interval to 12/6 hours.
sseago commented 3 weeks ago

@Lyndon-Li I think that's fine. 24/12/6 hour options should be sufficient. There's zero value in full maint more often than 4 hours, and exactly 4 hours could produce edge cases (i.e. last full maint marked this blob 3:59:58 ago and therefore it's too soon to delete now by 2 seconds), and 5 hours doesn't give you consistent day-to-day maint times. So 6 is realistically the smallest value that makes sense.

mpryc commented 2 weeks ago

@Lyndon-Li I really like your idea to have pre-set options, that makes it easy for the user to configure preserving underlying repo requirements (e.g. <4h doesn't make sense, so user won't set unacceptable parameters).