Open kaovilai opened 3 weeks ago
I can be assigned this issue
@kaovilai I think what we want here is just a bool entry -- "alwaysUseFullMaintenance" or something. Auto is the default (so bool is false), which results in Kopia doing one full maint per day, and the rest are quick. When we set this to true we'll want every maintenance full. I don't think we ever want to always to quick -- that would mean data is never cleaned up.
Sure. Bool entry if that works for everyone.
Could we clarify the scenarios why we want users to config the maintenance mode?
Basically, we don't want users to change the mode, because full maintenance and quick maintenance are very different from each other, they are designed to happen alternatively and the quick one is more frequent. Changing it manually may cause unexpected consequences:
Keeping the data in a reasonable time is a policy of Kopia to assure the system success to work, manually changing the maintenance mode could not result in the data to be deleted earlier. Therefore, we should have users know that the repo maintains data on its own phase, this is to assure the data safety.
Another point:
Therefore, it is not safe nor necessary to add the maintenance mode into Unified Repository. At present, we let repo itself to decide how to do maintenance, including the mode and frequency, and offload the maintenance work totally to the repo itself
Full maintenance deletes data that quick maintenance doesn't. Running full maintenance too frequent causing the data to be deleted earlier unnecessarily and may result in potential data lose.
We probably want this to occur more often at least for testing/debugging. And we've been getting customer cases where they are saying maintenance does not actually work for them so backup expires but nothing is getting deleted.
And we've been getting customer cases where they are saying maintenance does not actually work for them so backup expires but nothing is getting deleted
This may be as the expected behavior, e.g., the data may be referenced by other backups and should not deleted. Otherwise, we need to treat it as a bug and find the root cause before doing changes.
We probably want this to occur more often at least for testing/debugging
For this purpose, if the debugging happens on users' production environments, changing anything to the maintenance is still not recommended since this may result in users' data lose; if the testing/debugging happens in our dev environments, I think we can change the code locally, moreover, as mentioned above, there are many margins of sub tasks need to be adjusted, only changing the mode may not make it work as expected.
@Lyndon-Li "Full maintenance deletes data that quick maintenance doesn't. Running full maintenance too frequent causing the data to be deleted earlier unnecessarily and may result in potential data lose."
There shouldn't be any risk here, since kopia requires 4 separate full maintenance cycles at least four hours apart before it will remove any data. The concern is that with the default "once a day" full maintenance, it will be 24 hours at the earliest, but up to 48 hours once a blob is no longer referenced by a needed snapshot. We could reduce this window to 4+ hours if full maintenance ran more often. But even if you ran full maintenance constantly (which we wouldn't actually want) it shouldn't put the data at risk because kopia's built-in safety mechanisms require GC to mark a blob as safe to delete during two separate full maint cycles at least 4 hours apart.
I don't know if this is possible, but maybe there's a way to configure the kopia repo to do full maintenance more than once per day when velero runs maintenance with "mode=auto" -- that might be cleaner than a config to always run full, but I don't know whether that can be done. Then we could have behavior where full is done every 6 hours but quick every hour.
It looks like we probably can do that here: https://github.com/vmware-tanzu/velero/blob/db470a751b7a86c1f3e05d628a4694d84e6777ea/pkg/repository/udmrepo/kopialib/lib_repo.go#L595
if overwriteFullMaintainInterval != time.Duration(0) {
logger.Infof("Full maintenance interval change from %v to %v", p.FullCycle.Interval, overwriteFullMaintainInterval)
p.FullCycle.Interval = overwriteFullMaintainInterval
}
Maybe making this configurable is preferable to an "always use full maint" flag. Then we could recommend for users who want data to be deleted more quickly to set this to 6 or 12 hours instead of the default 24.
@Lyndon-Li "Full maintenance deletes data that quick maintenance doesn't. Running full maintenance too frequent causing the data to be deleted earlier unnecessarily and may result in potential data lose."
There shouldn't be any risk here, since kopia requires 4 separate full maintenance cycles at least four hours apart before it will remove any data. The concern is that with the default "once a day" full maintenance, it will be 24 hours at the earliest, but up to 48 hours once a blob is no longer referenced by a needed snapshot. We could reduce this window to 4+ hours if full maintenance ran more often. But even if you ran full maintenance constantly (which we wouldn't actually want) it shouldn't put the data at risk because kopia's built-in safety mechanisms require GC to mark a blob as safe to delete during two separate full maint cycles at least 4 hours apart.
Yes, if you change the mode but not any margin, full maintenance doesn't make any effect but consume more resources; if you change the mode and also some margins, data risk will happen.
It looks like we probably can do that here:
if overwriteFullMaintainInterval != time.Duration(0) { logger.Infof("Full maintenance interval change from %v to %v", p.FullCycle.Interval, overwriteFullMaintainInterval) p.FullCycle.Interval = overwriteFullMaintainInterval }
Maybe making this configurable is preferable to an "always use full maint" flag. Then we could recommend for users who want data to be deleted more quickly to set this to 6 or 12 hours instead of the default 24.
This looks more rational. The overwrite value could be set to udmrepo.RepoOptions
through backupRepository config.
Besides, I have two more suggestions on this direction:
p.FullCycle.Interval
is also used and checked elsewhere in the Kopia code and so should be controlled within a reasonable value.fastGC
/eagerGC
option. When this repo option is set, we overwrite the full maintenance interval to 12/6 hours.@Lyndon-Li I think that's fine. 24/12/6 hour options should be sufficient. There's zero value in full maint more often than 4 hours, and exactly 4 hours could produce edge cases (i.e. last full maint marked this blob 3:59:58 ago and therefore it's too soon to delete now by 2 seconds), and 5 hours doesn't give you consistent day-to-day maint times. So 6 is realistically the smallest value that makes sense.
@Lyndon-Li I really like your idea to have pre-set options, that makes it easy for the user to configure preserving underlying repo requirements (e.g. <4h doesn't make sense, so user won't set unacceptable parameters).
Describe the problem/challenge you have
We want ability to configure maintenance interval to affect change to storage more quickly in some cases.
These can be configured in the
repo-maintenance-job-configmap
Describe the solution you'd like
Anything else you would like to add:
Environment:
velero version
):kubectl version
):/etc/os-release
):Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
cc: @shubham-pampattiwar @weshayutin